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Abstract 

Given  a  set  of  random  variables,  it  is  often  the  case  that  their  associations  can  be  explained  by 
hidden  common  causes.  We  present  a  set  of  well-defined  assumptions  and  a  provably  correct 
algorithm  that  allow  us  to  identify  some  of  such  hidden  common  causes.  The  assumptions 
are  fairly  general  and  sometimes  weaker  than  those  used  in  practice  by,  for  instance,  econo¬ 
metricians,  psychometricians,  social  scientists  and  in  many  other  fields  where  latent  variable 
models  are  important  and  tools  such  as  factor  analysis  are  applicable.  The  goal  is  auto¬ 
mated  knowledge  discovery:  identifying  latent  variables  that  can  be  used  across  diferent 
applications  and  causal  models  and  throw  new  insights  over  a  data  generating  process.  Our 
approach  is  evaluated  throught  simulations  and  three  real-world  cases. 
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1  Introduction 


Latent  variables  are  everywhere  in  science.  Concepts  such  as  gravitational  helds,  subatomic 
particles  or  various  classes  of  antibodies  are  essential  building  blocks  of  models  of  great 
practical  impact,  and  yet  such  entities  are  unobservable.  Sometimes  there  is  overwhelming 
evidence  that  hidden  variables  are  actual  physical  entities,  and  sometimes  they  are  useful 
abstractions  to  be  added  to  the  scientific  vocabulary  to  make  the  description  of  Nature  more 
tractable. 

For  instance,  focusing  in  onr  particular  interest  of  artificial  intelligence  (AI),  it  is  hard 
to  conceive  a  robot  or  any  kind  of  intelligent  agent  seemly  integrated  to  its  environment  if 
such  agent  is  not  able  to  reason  with  latent  variables.  Consider  a  futuristic  version  of  Pearl, 
the  nursing  robot  described  in  (Pineau  et  ah,  2003).  Imagine  the  task  of  autonomously 
attempting  to  diagnostic  and  reduce  stress  or  depression  levels  of  a  patient,  considering  that 
someone  suffering  from  depression  will  not  in  many  cases  ask  for  help.  If  the  robot  detects 
the  patient  is  feeling  too  stressed,  it  could  remotely  contact  healthcare  professionals  to  come 
over  and  properly  treat  the  patient.  This  would  be  especially  useful  if  he  or  she  is  an  elderly 
person  living  alone. 

However,  “stress”  is  not  an  easily  describable  concept:  unlike  “height”  and  “weight”, 
there  is  no  simple  scale  for  it.  Instead,  one  can  measure  stress  through  a  varied  set  of 
indicators  such  as  blood  pressure,  amount  of  hours  slept  by  day,  cold  and  sweaty  hands,  and 
so  on.  By  using  such  indicators  obtained  from  physical  sensors,  an  agent  is  able  to  reason 
about  a  latent  concept  and  do  the  proper  intervention  in  the  world.  In  either  way,  latent 
variables  play  a  major  role  in  the  process  of  scientific  modeling  and  discovery,  and  any  tool 
that  could  aid  the  discovery  of  latent  variables  would  be  of  great  interest. 

This  is  the  goal  of  this  paper.  We  introduce  a  machine  learning  algorithm  to  discover 
possible  hidden  common  causes  of  a  set  of  observed  variables  in  a  causal  graphical  model 
framework.  Unlike  factor  analysis,  there  is  no  need  to  rely  on  arbitrary  rotations  of  the  latent 
space.  Unlike  general  hill-climbing  algorithms  over  directed  acyclic  graphs  (DAGs)  with 
latent  variables,  our  approach  provides  an  equivalence  class  of  models  that  are  empirically 
indistinguishable.  Moreover,  a  proof  of  consistency  of  the  algorithm  is  given  on  the  limit  of 
infinite  data.  That  is,  given  the  constraints  that  hold  in  the  population  over  the  measured 
variables,  and  a  set  of  assumptions  we  make  explicit  in  Section  3  below,  the  algorithm  will 
output  an  equivalence  class  that  includes  the  correct  latent  variable  measurement  model. 

Our  assumptions  are  described  in  detail.  The  most  important  assumption  is  that  observed 
variables  are  measures  of  a  set  of  unknown  latent  variables.  In  graphical  model  terminology, 
it  means  that  no  observed  variable  is  an  ancestor  of  a  latent  variable,  but  direct  connections 
among  observed  variables  are  allowed.  A  stronger  variation  of  this  assumption  is  widely  used 
in  other  latent  variable  discovery  methods  such  as  exploratory  factor  analysis.  The  graphical 
structure  of  the  latent  nodes  is  free  to  take  any  form:  an  arbitrary  DAG,  a  DAG  with  other 
hidden  common  causes,  cyclic  graphs. 

In  this  work,  we  will  not  discuss  how  to  learn  the  structure  among  latent  variables.  In¬ 
stead,  we  will  provide  an  algorithm  to  learn  a  graphical  structure  describing  which  latent 
variables  are  parents  of  which  observed  variables,  i.e.,  a  measurement  model.  The  procedure 
is  an  exploratory  data  analysis,  or  data  mining,  method  to  discover  latent  concepts  that  can 
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be  useful  for  AI  applications  and  in  a  variety  of  scientific  models.  The  measurement  model 
obtained  can  then  be  fixed  such  that  another  learning  procedure  is  applied  to  search  for  a 
structure  among  latents,  as  done  in  Silva  (2002),  but  such  problem  can  be  treated  indepen¬ 
dently  and  will  not  be  further  discussed  to  provide  a  better  focus  on  learning  measurement 
models. 

This  paper  is  organized  as  follows: 

•  Section  2:  Related  work  is  a  brief  overview  of  other  approaches  directly  or  indirectly 
related  to  the  task  of  building  a  measurement  model  from  data; 

•  Section  3:  Problem  statement  and  assumptions  formally  defines  the  problem  and 
introduces  which  assumptions  are  considered  in  order  to  provide  a  rigorous  interpreta¬ 
tion  of  our  models.  Such  assumptions  will  be  essential  when  proving  the  consistency 
of  our  procedure; 

•  Section  4:  Learning  measurement  models  is  the  main  section,  describing  the  stan¬ 
dard  algorithm  for  learning  a  representation  of  a  set  of  measurement  models  consistent 
with  the  data.  This  section  considers  the  learning  problem  assuming  the  population 
joint  distribution  is  known.  Later  sections  will  treat  the  problem  of  learning  with  finite 
samples; 

•  Section  5:  Purification  and  identifiability  describes  a  specific  class  of  measure¬ 
ment  models  that  in  practical  applications  will  be  the  representation  of  choice  due  to 
theoretical  and  practical  reasons; 

•  Section  6:  Statistical  learning  and  practical  implementations  details  how  to 
use  the  given  algorithms  when  the  population  joint  density  is  not  known  and  which 
heuristics  can  be  used  to  improve  robustness  to  sample  variability,  and  how  how  to 
deal  with  the  computational  complexity  of  this  procedure; 

•  Section  7:  Empirical  results  discusses  series  of  experiments  with  simulated  data 
and  three  real-world  data  sets,  along  with  criteria  of  success; 

•  Section  8:  Conclusion  wraps  up  the  contributions  of  this  paper. 


2  Related  work 

Arguably,  the  most  traditional  framework  for  discovering  latent  variables  is  through  factor 
analysis  (see,  e.g.,  Johnson  and  Wichern,  2002).  A  number  of  factors  is  chosen  based  in 
some  criterion  such  as  the  minimum  number  of  factors  that  fit  the  data  at  a  given  level  or 
the  number  that  maximizes  a  score  such  as  BIC.  After  fitting  the  data,  usually  assuming  a 
Gaussian  distribution,  different  transformations  to  the  latent  covariance  matrix  are  applied 
in  order  to  satisfy  some  criteria  of  simplicity.  Latents  are  interpreted  based  on  the  magnitude 
of  the  loadings  (the  coefficients  relating  each  observed  variable  to  each  latent). 

This  method  can  be  quite  unsatisfactory  due  to  the  underterminacy  of  the  solution  in 
the  Gaussian  case.  Rotation  methods  used  to  transform  the  latent  covariance  matrix  have 
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no  formal  justification.  For  non-Gaussian  cases,  variations  such  as  independent  component 
analysis  and  independent  factor  analysis  (Attias,  1999)  or  tree-based  component  analysis 
(Bach  and  Jordan,  2003)  do  little  to  contribute  to  solve  the  problem  of  learning  measurement 
models:  by  severely  constraining  latent  relationships  through  marginal  independencies  or  at 
most  pairwise  dependencies,  the  goal  is  to  obtain  good  joint  density  estimation  or  to  perform 
blind  source  separation,  but  not  model  interpretation  or  latent  concept  discovery. 

In  constrast,  Zhang  (2004)  does  provide  a  sound  representation  for  measurement  models 
for  discrete  observed  and  latent  variables  with  a  multinomial  probabilistic  model.  The  model 
is  constrained  to  be  a  tree,  and  every  observed  variable  has  one  and  only  (latent)  parent  and 
no  child.  Therefore  no  observed  variable  can  be  a  child  of  another  observed  variable  or  a 
parent  of  a  latent.  To  some  extent,  an  equivalence  class  of  graphs  is  described,  which  limits 
the  number  Intents  and  the  possible  number  of  states  each  categorical  latent  variable  can 
have  without  being  empirically  indistinguishable  from  another  graph  with  less  Intents  or 
less  states  per  latent.  Under  these  assumptions,  the  set  of  possible  latent  variable  models 
is  therefore  finite.  Besides  being  useful  to  model  joint  probability  distributions,  Zhang  also 
points  out  that  such  model  can  be  used  to  cluster  analysis,  generalizing  standard  one-latent 
approaches  for  clustering  such  as  AutoClass  (Cheeseman  and  Stutz,  1996).  However,  as 
pointed  out  by  Zhang,  this  choice  of  representation  does  not  guarantee  that  every  joint 
distribution  can  be  modeled  well. 

A  related  approach  is  given  by  Elidan  et  al.  (2000)  where  latent  variables  are  introduced 
into  dense  regions  of  a  DAG  learned  through  standard  algorithms.  Once  one  latent  is  intro¬ 
duced  as  the  parent  of  a  set  of  nodes  originally  strongly  connected,  the  standard  search  is 
executed  again  and  the  process  is  iterated.  They  provide  several  results  where  this  procedure 
is  effectively  able  to  increase  the  fit  over  a  latent-free  graphical  model,  but  little  is  discussed 
about  how  to  interpret  the  output.  No  equivalence  classes  are  given,  and  all  examples  de¬ 
scribed  in  Elidan  et  al.  (2000)  and  Elidan  and  Friedman  (2001),  comparing  an  estimated 
structure  against  a  true  model  structure  known  by  simulation,  use  as  starting  points  graphs 
that  are  very  close  to  the  true  graph.  The  main  problem  of  using  this  approach  for  model 
interpretation  and  causal  analysis  is  the  lack  of  a  description  of  which  graphs  are  empirically 
indistinguishable. 

Silva  et  al.  (2003)  provide  the  foundations  of  the  work  here  described.  In  the  next 
sections,  we  discuss  how  we  generalize  the  previous  approach  and  which  new  heuristics  are 
applied.  The  present  work  itself  is  inspired  by  the  approaches  introduced  in  Glymour  et  al. 
(1987),  where  measurement  models  are  modified  based  on  an  initial  model  where  all  latents 
are  given,  not  discovered.  More  discussion  about  related  work  is  also  given  in  Silva  et  al. 
(2003). 

3  Problem  statement  and  assumptions 

The  goal  of  learning  measurement  models  is  identifying  unmeasured  concepts  (“factors”) 
that  causally  explain  the  associations  measured  over  a  set  of  observable  random  variables. 
The  framework  of  causal  graphical  models  will  be  used  as  a  formal  language  to  describe  our 
approach.  Concepts  such  as  graphs,  paths,  causal  graphs,  d-separation  and  causal  Markov 
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condition  will  be  used.  Unless  otherwise  specified,  references  such  as  Pearl  (1988,  2000)  and 
Spirtes  et  al.  (2000)  contain  all  the  necessary  definitions  in  full  detail. 

The  following  definitions  introduce  the  families  of  measurement  models  and  graphical 
models  of  interest.  We  begin  by  our  particular  definition  of  latent  variable  graph. 

Definition  1  (Latent  variable  graph)  A  latent  variable  graph  G  is  a  graph  with  the  fol¬ 
lowing  characteristics: 

1.  there  are  two  types  of  nodes:  observed  and  latent; 

2.  no  observed  node  is  an  ancestor  of  any  latent  node; 

3.  each  observed  node  is  a  child  of  ( measures )  at  least  one  latent  node; 

4 ■  there  are  no  cycles  involving  an  observed  variable; 

The  notation  G(O)  will  sometimes  be  used  to  denote  a  latent  variable  graph  G  with 
a  set  of  observed  variables  O.  The  second  assumption,  which  we  call  the  measurement 
assumption ,  cannot  in  general  be  tested  empirically.  Nevertheless,  its  use  is  justified  in 
several  applications  (e.g.,  Bartholomew  et  ah,  2002).  It  is  also  the  core  assumption  of  all 
procedures  with  goals  similar  to  factor  analysis,  even  when  it  is  not  made  explicit.  See  Silva 
et  al.  (2003)  for  more  on  this  topic.  Also  important,  it  partitions  the  graph  in  two  main 
parts,  one  of  them  composed  of  latent  variables  only.  We  can  explore  this  modularization 
when  defining  a  parameterization  of  the  latent  variable  graph  in  order  to  avoid  making 
unnecessary  assumptions  about  the  causal  structure  of  the  unobserved  variables. 

In  the  next  section,  we  define  which  types  of  models  our  latent  variable  graphs  can 
represent.  We  then  introduce  a  particular  useful  equivalence  class  of  models  and  formally 
state  the  problem  of  learning  measurement  models  under  the  given  setup. 

3.1  Interpretation  and  parameterization 

We  assume  that  a  latent  variable  graph  G  is  quantitatively  instantiated  as  a  semi-parametric 
model  with  the  following  properties: 

1.  G  satisfies  the  Causal  Markov  condition  (Spirtes  et  ah,  2000;  Pearl,  2000); 

2.  each  observed  node  is  a  linear  function  of  its  parents  plus  an  additive  error  term  of 
positive  finite  variance  which  is  independent  of  every  other  error  term; 

3.  the  marginal  distribution  over  latent  variables  has  finite  second  moments,  positive 
variances  and  all  correlations  in  the  open  interval  (-1,  1). 

We  call  such  an  object  a  semilinear  latent  variable  model.  If  the  relationships  among 
the  latent  variables  are  also  linear,  that  is,  if  each  latent  variable  is  a  linear  function  of  its 
parents  plus  additive  noise,  then  we  call  it  a  linear  latent  variable  model,  an  instance  of  a 
structural  equation  model  (Bollcn,  1989).  For  simplicity,  we  will  assume  that  all  variables 
have  zero  mean.  Unless  otherwise  specified,  all  latent  variable  models  that  we  refer  to  in 
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this  report  are  semilinear  models.  Sometimes  we  will  call  the  graph  for  a  semilinear  model 
as  semilinear  latent  variable  graph.  A  linear  latent  variable  graph  is  defined  analogously. 

The  linearity  assumption  linking  the  parents  of  an  observed  variable  to  itself  is  one 
way  of  constraining  the  classes  of  models  represented  by  latent  variable  graphs.  We  call  such 
assumption  linearity  of  measurement.  With  arbitrary  functional  relationships  among  children 
and  parents  and  arbitrary  structure,  any  data  can  fit  some  latent  variable  model  (Suppes 
and  Zanotti,  1981).  For  instance,  to  be  able  to  introduce  an  useful,  constrained  latent 
variable  model,  Zhang  (2004)  assumes  that  the  latent  variable  graph  has  a  tree  structure, 
and  variables  are  discrete.  He  does  not  assume  linearity  of  measurement. 

However,  our  work  concerns  graphical  causal  modeling:  representing  causal  processes  as 
directed  graphs.  Assuming  the  true  (and  unknown)  processes  in  nature  that  generate  our 
data  to  have  the  graphical  structure  of  a  tree  is  not  very  interesting  in  most  cases,  considering 
the  bulk  of  applications  of  latent  variable  models  in  many  sciences  such  as  econometrics, 
social  sciences  and  psychometrics  (e.g.,  Bollcn  (1989)),  all  of  which  share  many  points  in 
common  with  AI  modeling.  We  prefer  to  allow  the  graphical  structure  over  the  latent  nodes 
to  be  entirely  unrestricted:  an  arbitrary  DAG,  a  DAG  with  other  hidden  common  causes,  a 
cyclic  graph,  etc.  Linearity  of  measurement  might  seem  restrictive,  but  it  is  often  explicitly 
designed  into  econometric,  psychometric,  and  social  scientific  studies.  Althought  the  linearity 
of  measurement  assumption  is  not  sufficient  to  guarantee  full  identifiability  of  a  graph  as  we 
will  see  later,  it  is  still  useful  to  distinguish  a  variety  of  features  that  only  some  graphs  can 
share  for  a  given  distribution. 

Notice  also  that  requiring  linear  direct  effects  from  latents  into  observed  variables  can 
be  interpreted  just  as  a  change  of  latent  space.  For  instance,  suppose  we  have  the  graphical 
model  depicted  in  Figure  1(a),  in  which  for  simplicity  we  do  not  consider  error  terms.  Vari¬ 
able  77  has  a  linear  direct  effect  in  three  variables,  and  a  nonlinear  effect  in  the  remaining 
three.  The  same  model  can  be  represented  as  in  Figure  1(b),  where  the  latent  space  is  split 
into  two  latent  variables  with  a  linear  measurement  model1.  The  process  r)i  — >  772  is  a  vari¬ 
able  equivalent  to  77  and  the  fact  that  we  can  break  it  down  into  two  simpler  hidden  common 
causes  might  actually  improve  the  interpretation  of  the  model.  The  assumption  about  linear 
latent  effects  on  observed  variables  is  therefore  weaker  than  it  might  seem  in  principle:  it 
is  basically  a  way  of  defining  which  latent  variables  can  be  considered  direct  causes  of  the 
observed  variables. 

Given  the  definition  of  latent  variable  model,  we  can  now  introduce  the  following  key 
definition: 

Definition  2  (Measurement  model)  Let  G( O)  be  a  latent  variable  model.  The  submodel 
containing  the  complete  set  of  nodes  of  G,  and  all  and  only  those  edges  that  point  into  O,  is 
called  the  measurement  model  of  G. 

Linder  this  context,  observed  variables  are  also  called  indicators.  Therefore,  the  graphical 
representation  of  the  measurement  model  of  a  latent  variable  model  is  just  its  subgraph  when 

1Notice  however  that  we  might  not  be  able  to  define  an  intervention  for  variables  { V ,  V ,  V }  that  does 
not  affect  variables  {X\1  X2,  W}-  In  our  causal  framework,  it  means  that  latent  772  is  a  variable  that  cannot 
be  manipulated. 
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Figure  1:  (a)  A  single  latent  explains  the  association  of  random  variables  {Xl}  Yt}  through 
non-linear  direct  effects  (for  simplicity,  assuming  no  random  noise  is  added  to  indicators), 
(b)  This  variable  can  be  written  as  two  different  events  associated  by  a  directed  edge,  where 
now  all  direct  effects  on  the  indicators  are  linear. 

we  remove  all  edges  that  might  exist  among  latent  variables.  Notice  that  graphically  the 
measurement  model  is  a  DAG.  Also,  a  DAG  submodel  containing  a  subset  of  the  observed 
nodes  of  G ,  their  latent  parents,  and  all  and  only  those  edges  that  point  into  these  observed 
nodes,  is  called  a  measurement  submodel.  In  an  abuse  of  notation,  sometimes  we  refer  to  the 
graphical  representation  of  the  measurement  model  (i.e.,  the  measurement  model  graph )  as 
simply  “measurement  model”. 

The  remaining  edges  of  G,  along  with  the  respective  latent  nodes,  form  the  complemen¬ 
tary  structure  defined  below: 

Definition  3  (Structural  model)  Let  G( O)  be  a  latent  variable  model.  The  submodel 
containing  only  the  latents  of  G,  and  all  and  only  the  edges  between  latents,  is  called  the 
structural  model  of  G. 

Therefore,  the  union  of  a  measurement  model  and  a  structural  model  with  the  same  set  of 
latents  forms  a  latent  variable  model.  As  hinted  before,  we  will  not  discuss  here  how  to  learn 
a  structural  model.  Still,  these  two  tasks  are  related  according  to  this  loose  formulation  of 
our  discovery  problem:  assuming  that  the  true  model  is  a  latent  variable  model,  given  a  data 
set  with  variables  O,  find  the  set  of  measurement  models  over  O  that  are  indistinguishable 
under  a  certain  class  of  constraints  on  the  observed  marginal,  and  that  will  facilitate  finding 
the  Markov  equivalence  class,  or  the  Partial  Ancestral  Graph  of  the  structural  model  (Spirtes 
et.  al,  2000). 

Later  in  this  report,  we  briefly  describe  which  other  assumptions  could  be  used  to  support 
discrete  variables.  However,  we  do  need  two  extra  assumptions  for  any  result  in  this  report 
that  requires  the  true  model  G  to  be  a  linear  measurement  model  instead  of  the  more  general 
semilinear  model: 
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1.  G  satisfies  the  Faithfulness  assumption  (Spirtes  et  ah,  2000),  called  stability  in  Pearl 
(2000).  That  is,  a  any  conditional  independence  is  entailed  in  G  by  the  causal  Markov 
condition  if  and  only  if  it  holds  in  the  probability  distribution  over  the  variables  rep¬ 
resented  in  the  graph 

2.  G  is  acyclic; 

3.2  The  tetrad  equivalence  class 

In  order  to  be  able  to  distinguish  among  different  measurement  models  that  might  have 
generated  our  observed  joint  probability  distribution,  we  need  to  report  those  models  that 
are  compatible  with  observed  constraints  of  the  joint.  A  measurement  model  is  compatible 
if  it  entails  only  observed  constraints: 

Definition  4  (Constraint  entailment)  A  latent  variable  graph  G  entails  a  constraint  if 
and  only  if  the  constraint  holds  in  every  distribution  parameterized  by  the  pair  (Pc,  0),  where 
Pg  is  a  probability  distribution  over  the  latent  variables  that  satisfies  the  Markov  condition 
for  the  structural  model  in  G,  and  0  is  the  set  of  linear  coefficients  and  error  variances  for 
the  observed  variables.  The  measurement  model  of  G  entails  a  constraint  if  and  only  if  the 
constraint  holds  in  every  distribution  parameterized  by  the  pair  (Pq,Q),  where  Pq  is  any 
probability  distribution  among  the  latents.  (Pq  and  Pq  have  also  to  satisfy  the  assumptions 
on  latent  variable  models  about  first  and  second  moments.) 

We  are  interested  in  a  specific  class  of  constraints.  Given  the  covariance  matrix  of  four 
random  variables  {A,  B,C,  D},  we  have  that  zero,  one  or  three  of  the  following  constraints 
may  hold: 


VAB&CD  =  CFAC&BD 
VACCbD  =  &AD&BC 
VAB&CD  =  &AD&BC 

where  oxy  represents  the  covariance  of  X  and  Y. 

Like  conditional  independence  constraints,  different  latent  variable  graphs  might  entail 
different  tetrad  constraints.  Therefore,  a  given  set  of  observed  tetrad  constraints  will  restrict 
the  set  of  possible  latent  variable  graphs  that  are  compatible  with  the  data.  We  restrict 
our  algorithm  to  search  for  measurement  models  that  entail  the  observed  tetrad  constraints 
and  vanishing  partial  correlations  judged  to  hold  in  the  population.  Since  these  constraints 
ignore  any  information  concerning  the  joint  distribution  besides  its  second  moments,  this 
might  seem  an  unnecessary  limitation.  What  can  be  learned  from  these  constraints  can  be 
substantial,  however,  and  attending  to  only  the  lower  order  moments  makes  the  algorithm  less 
prone  to  statistical  errors.  The  empirical  results  discussed  in  Section  7  support  this  tradeoff. 
Assuming  that  the  correct  model  entails  all  such  constraints  in  the  marginal  probability 
distribution  is  a  restricted  version  of  the  Faithfulness  assumption  discussed  in  (Spirtes  et  al., 
2000). 

In  the  particular  case  of  linear  models,  tetrad  constraints  have  a  well-defined  graphical 
implication.  First,  we  need  to  introduce  a  few  more  definitions: 


7 


•  in  a  graphical  model,  a  collider  on  a  path  is  a  pair  of  consecutive  directed  edges  on 
this  path  such  that  both  edges  point  to  the  same  node; 

•  a  trek  between  a  pair  of  nodes  X  and  Y  is  an  (undirected)  path  that  does  not  contain 
any  collider; 

•  a  choke  point  for  two  sets  of  nodes  X  and  Y  is  a  node  that  lies  on  every  trek  between 
an  element  of  X  and  an  element  of  Y2 

A  graphical  characterization  of  tetrad  constraints  for  linear  graphs  is  known  under  the 
Faithfulness  assumption: 

Theorem  1  (The  Tetrad  Representation  Theorem)  Let  G  be  a  linear  graph,  and  let 
Ji,/2,  Ji,  J2  be  four  variables  in  G.  Then  (ri1j1cri2j2  =  o‘/1j2o'/2j1  if  and  only  if  there  is  a 
choke  point  between  {I\,  J2}  arid  {  Ji,  J2}. 

Proof:  See  Shafer  et  al.  (1993)  and  Spirtes  et  al.  (2000).  □ 

One  can  see  how  tetrad  constraints  are  useful  for  learning  the  structure  of  latent  variable 
graphs  in  the  linear  case:  for  instance,  if  one  is  given  a  linear  latent  variable  graph  as  a 
starting  point,  this  graph  will  entail  several  tetrad  constraints  that  may  hold  or  not  among 
observed  variables,  and  various  modifications  can  be  suggested  to  the  current  structure  in 
order  to  make  it  entail  more  of  the  tetrad  constraints  that  hold  in  the  probability  distribution 
and  less  of  the  constraints  that  do  not  hold.  This  is  explored  in  Glymour  et  al.  (1987)  and 
Spirtes  et  al.  (2000). 

In  this  work,  we  explore  principled  approaches  to  reconstruct  several  features  of  the 
graphical  structure  of  an  unknown  measurement  model  based  on  the  covariance  matrix  of 
the  observed  variables,  where  no  starting  graph  is  required  and  the  true  model  can  be 
semilinear.  It  is  an  extension  of  the  work  of  Silva  et  al.  (2003)  with  relaxed  assumptions. 
The  principle  continues  to  be  matching  entailed  tetrad  constraints  to  observed  ones. 

However,  since  there  is  no  known  graphical  criterion  of  tetrad  entailment  for  arbitrary 
semilinear  latent  variable  graphs  (or  even  for  vanishing  partial  correlations,  which  will  also 
be  useful)  such  as  the  d-separation  calculus  for  conditional  independencies,  we  have  to  rely 
on  the  Definition  4,  which  is  not  purely  graphical.  It  is  basically  a  criterion  of  invariance  with 
respect  to  the  parameters  of  the  measurement  model.  Invariance  with  respect  to  parameters 
is  the  key  property  of  what  is  sometimes  called  a  “structural”  constraint  (e.g.,  as  in  Shafer 
et  ah,  1993)  and  we  claim  nothing  is  lost  in  causal  analysis  by  defining  entailment  in  a  causal 
graph  where  the  causal  features  that  are  not  of  immediate  interest  are  not  parameterized  (in 
our  case,  the  structural  model).  We  will  show  several  results  that  hold  only  with  probability 
1  with  respect  to  a  Lebesgue  measure  taken  over  0,  the  linear  coefficients  and  error  variances 
in  such  graphs,  but  in  practice  this  is  no  stronger  than  assuming  the  Faithfulness  condition, 
which  is  known  to  fail  for  a  set  of  parameters  that  has  measure  zero  (Spirtes  et  ah,  2000)  for 
linear  models. 

2This  is  actually  the  definition  of  weak  choke  point  as  explained  in  Shafer,  Kogan  and  Spirtes  (1993),  but 
it  will  suffice  for  our  exposition.  Since  we  do  not  make  use  of  the  definition  of  choke  point  except  in  some 
proofs  in  the  appendix,  and  such  definition  requires  a  more  detailed  explanation,  we  defer  the  presentation 
of  the  full  definition  to  the  appendix  to  avoid  interrupting  the  flow  of  the  text. 


Figure  2:  In  this  model,  different  choices  of  covariance  matrix  for  latents  {Li,  L2l  L3}  will 
make  three  or  only  one  tetrad  constraint  for  variables  {X\,  X2,  Y\ ,  Y2}  hold. 


In  order  to  understand  the  difference  between  entailmcnt  by  a  latent  variable  graph 
and  entailment  by  its  respective  measurement  model,  one  can  look  at  the  example  given  in 
Figure  2.  Latent  L2  is  a  choke  point  (Xi,X2)  x  (Y\,Y2)  and  will  imply  the  tetrad  constraint 
<Jx1y1&x2y2  —  <jx1y2<tx2y1  independently  of  the  model  being  linear  or  semilinear.  However, 
the  other  possible  tetrad  (7x1x2(7y1y2  =  o’x1y1o’x2y2  will  hold  if  and  only  if  erf,  <7 l2l3  =  <7l2<7l3, 
i.e.,  the  partial  correlation  of  L2  and  L3  conditioned  on  L\  being  zero,  which  is  true  for  all 
probability  distributions  that  are  Markov  relative  to  the  latents  in  this  graph,  but  not  for  an 
arbitrary  latent  covariance  matrix.  Therefore,  this  particular  tetrad  is  not  entailed  by  the 
measurement  model.  We  need  to  distinguish  between  the  two  forms  of  entailmcnt  because 
we  want  to  learn  about  measurement  models  independently  of  the  possible  structural  model 
of  the  true  latent  graph.  They  will  therefore  form  equivalence  classes. 

Definition  5  (Tetrad  equivalence  class)  A  tetrad  equivalence  class  T(C )  is  a  set  of  la¬ 
tent  variable  graphs  T  all  of  whose  measurement  models  entail  the  same  set  of  tetrad  con¬ 
straints  and  vanishing  partial  correlations  C  among  the  measured  variables.  An  equivalence 
class  of  measurement  models  M(C)  for  C  is  the  union  of  the  measurement  models  in  T{C). 

To  summarize,  we  assume  that  the  true  model  is  a  latent  graphical  model  with  the  prop¬ 
erties  described  in  this  Section.  Linder  this  condition,  several  results  will  be  proved  in  the 
next  sections.  The  goal  is  not  identifying  the  exact  true  measurement  model,  because  in 
general  our  assumptions  are  still  strong  for  such  task.  The  general  problem  can  then  be 
reformulated  as  follows:  assuming  the  true  model  is  a  latent  variable  model,  given  a  data 
set  with  variables  O,  return  all  possible  measurement  models  over  O  that  are  indistinguish¬ 
able  under  the  class  of  tetrad  constraints  and  vanishing  partial  correlations  on  the  observed 
marginal.  We  will  show  this  is  possible  to  some  extent. 

An  interesting  question  is  if  it  makes  a  difference  assuming  the  true  graph  is  linear  instead 
of  semilinear,  i.e.:  if,  for  some  set  C  of  tetrad  and  vanishing  partial  correlation  constraints  and 
a  fixed  latent  probability  distribution  Pg  faithful  to  a  linear  model,  the  set  of  possible  linear 
latent  models  conditioned  on  Pq  that  entail  C  is  strictly  smaller  than  the  set  of  semilinear 
models.  The  answer  for  this  question  and  the  reason  we  are  interested  in  results  for  a  fixed 
marginal  distribution  for  the  latents  will  be  discussed  in  Section  4.2. 
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4  Learning  measurement  models 

In  this  section,  we  introduce  different  criteria  for  learning  features  common  to  all  possible 
latent  variable  graphs  that  generated  the  observed  tetrad  constraints  and  vanishing  partial 
correlations.  Sections  4.1,  4.2  and  4.3  describe  which  constraints  are  used  and  which  struc¬ 
tural  features  can  be  discovered.  Section  4.4  will  introduce  an  algorithm  that  uses  those 
constraints  to  output  measurement  models  compatible  with  the  observed  covariance  matrix. 

4.1  Locally  sound  constraint  sets 

There  is  a  specific  class  of  sets  of  probabilistic  constraints  of  practical  interest  which  we  will 
denominate  locally  sound  constraint  sets.  A  locally  sound  constraint  set  is  a  collection  of 
constraints  on  the  joint  distribution  of  k  observed  variables,  where  A;  is  a  constant  that  does 
not  grow  with  the  total  number  of  given  variables.  The  variables  used  in  the  constraint  set 
are  called  the  domain  of  the  constraint  set.  When  such  constraint  set  holds,  then  it  should 
be  sound  (as  defined  in  the  next  paragraph)  to  infer  some  particular  feature  of  the  unknown 
graph  of  interest.  For  instance,  algorithms  such  as  the  PC  Search  (Spirtes  et  ah,  2000)  and 
GES  (Meek,  1997)  test  constraints  that  can  refer  to  up  to  all  variables  in  the  domain,  and 
therefore  can  not  be  considered  “local”  in  the  sense  given  here.  However,  anytime  variations 
of  the  same  idea  such  as  the  Anytime  FCI  algorithm  of  Spirtes  (2000)  fixes  the  size  of  the 
largest  number  of  variables  on  which  tests  of  conditional  independence  are  evaluated,  and 
therefore  such  constraints  can  be  considered  locally  sound  constraints  under  that  context. 

Let  the  true  latent  variable  model  G  be  parameterized  by  a  pair  (Pg,@),  where  Pq 
is  a  joint  distribution  over  its  Intents  that  is  Markov  relative  to  G  and  0  is  the  set  of 
coefficients  and  error  variances  for  the  respective  measurement  model.  We  define  soundness 
of  a  constraint  set  in  the  context  of  latent  variable  models  as  follows:  if  a  constraint  set 
establishes  that  certain  feature  should  hold  in  G,  then  the  probability  of  failure  is  zero  with 
respect  to  a  Lebesgue  measure  over  0.  That  is,  for  some  set  of  values  of  0,  an  inference  rule 
using  the  constraint  set  is  allowed  to  fail,  as  long  as  this  set  has  measure  zero.  However,  this 
constraint  should  hold  for  every  Pq-  The  reason  is  we  do  not  know  how  to  quantify  if  the 
set  of  distributions  Pq  in  which  the  constraint  set  rule  erroneously  applies  is  a  “small”  set 
in  some  measurable  sense  since  Pq  might  be  a  result  of  the  Markov  condition  applied  to  the 
unknown  functional  relationships  among  Intents  in  G,  and  we  do  not  make  assumptions  on 
how  the  parameterization  of  such  functions  is  done.  As  discussed  before,  allowing  a  chance 
of  error  with  probability  zero  is  not  stronger  than  assuming  the  Faithfulness  condition  in, 
say,  linear  DAGs. 

The  computational  cost  is  not  the  only  attractive  feature  of  locally  sound  constraint  sets: 
it  is  a  reasonable  idea  not  to  rely  on  constraints  with  a  large  number  of  variables  because 
statistical  decisions  are  less  reliable.  The  theoretical  results  should  then  be  constructed 
with  this  self-imposed  limitation  in  mind,  making  the  theory  more  relevant  for  practical 
applications. 

There  are  two  main  structural  features  of  measurement  models  that  can  be  discovered 
by  our  method: 

•  instances  where  two  given  observed  variables  cannot  have  a  common  parent  in  any 
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latent  variable  graph  entailing  the  observed  tetrad  constraints  and  vanishing  partial 
correlations  (Section  4.2); 

•  instances  where  a  given  observed  variable  cannot  be  an  ancestor  of  another  given 
observed  variable  in  any  latent  variable  graph  entailing  the  observed  tetrad  constraints 
and  vanishing  partial  correlations  (Section  4.3); 

For  the  first  situation,  we  will  make  use  of  constraint  sets  of  k  —  6.  The  reasons  are 
simple:  first,  because  of  its  practical  use,  as  illustrated  in  the  empirical  examples  described 
later  in  this  report.  Second,  because  they  are  the  simplest  constraint  sets  that  can  be  used, 
as  given  by  the  following  result: 

Theorem  3  There  is  no  locally  sound  tetrad  constraint  set  of  domain  size  less  than  6  for 
deciding  if  two  nodes  A  and  B  do  not  have  a  common  parent  in  a  latent  variable  graph  G,  if 
px  1x2.x3  ~f~  0  and  px  1x2  7^  0  for  all  {Ad,  X2}  in  the  domain  of  the  constraint  set  and  observed 
variable  X:i. 

All  of  our  non-trivial  constraint  sets  require  partial  correlations  to  be  nonzero,  and  it  can 
be  argued  that  there  might  be  combinations  of  vanishing  partial  correlations  and  vanishing 
tetrads  that  could  be  used  instead.  We  claim  this  combination  is  not  likely  to  be  useful. 
We  are  mostly  interested  in  tetrad  constraints  that  arise  because  of  some  latent  choke  point, 
and  if  such  node  exists,  then  no  correlations  and  partial  correlations  over  those  variables  will 
vanish.  On  the  other  hand,  if  the  choke  point  is  an  observed  variable,  the  we  can  use  directly 
the  observed  vanishing  partial  correlations  to  infer  that  some  nodes  cannot  share  a  parent 
without  using  tetrad  constraints. 

It  is  certainly  possible  to  use  vanishing  partial  correlations  only  in  order  to  detect  some  in¬ 
stances  where  two  nodes  cannot  have  a  latent  common  parent:  the  FCI  algorithm  described 
in  Spirtes  et  al.  (2000)  does  it  even  for  some  situations  where  pairs  of  variables  are  dependent 
conditioned  in  any  subset  of  the  others.  In  a  more  restricted  sense,  conditional  independen¬ 
cies  can  be  used  to  rule  out  hidden  common  causes  among  pairs  of  variables  as  suggested  in 
Heckerman  (1998)  (and  tested  empirically  in  a  few  cases  by  Elidan  et  al.,  2000).  However,  in 
this  work  we  try  to  avoid  conditional  independencies  as  much  as  possible:  their  identification 
in  finite  samples  becomes  unreliable  based  on  the  size  of  the  conditioning  set  and  there  are 
other  theoretical  issues  on  the  reliability  of  conditional  independence  constraints  in  causal 
analysis  even  when  variables  are  strongly  independent  (Robins  et  al.,  2003).  This  becomes 
especially  relevant  in  onr  case,  which  is  biased  toward  models  where  all  variables  have  hidden 
common  causes.  In  Section  4.4  we  discuss  the  use  of  partial  correlations  in  the  context  of 
the  full  algorithm. 

4.2  Constraints  for  non-overlapping  parent  sets 

In  this  section,  we  describe  a  series  of  constraints  for  deciding  when  two  nodes  cannot  have 
a  common  (latent)  parent  in  a  latent  variable  graph  G(O).  We  start  by  a  constrain  set  rule 
(CS1)  given  as  follows: 
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_ For  variables  S  =  {Ad,  Ad,  X3,  l  j,  Y2,  Y3}  C  O,  if _ 

Pab  0,  Pab.c  f  0,  for  all  {A,  B}  C  S,  C  G  O 

0'X1Y10'X2X3  =  aXi  X2aX3Yi  =  0'A'iA30'A2ri 
OAiYi  &Y2Y3  =  <AYiY2CYi  Y3  =  CrXiY3CrYiY2 

_ CrA~iA2Q'YiY2  O~.Y1Y2g~.Y2Y1 _ 

then  Ad  and  Y\  cannot  have  a  common  parent  in  a  latent  variable  model 


The  correctness  of  such  rule  is  given  by  the  following  lemma: 

Lemma  3  Let  G(O)  be  a  semilinear  latent  variable  model.  Assume  {Ad,  Ad,  X3,  Y\ ,  Y2,  Y3}  C 

O  and  <7 XlYxV X-2X3  =  <7XlX2^X3Y3  =  <? XlX3V X2Y1J  crXiYiCrY2Y3  =  crAiY2CrYiY3  =  CT Ai Y3 ^ Yi Y2 ; 

aXlx 2^YiY2  f  ^AiY2o'a2Yi  and  that  for  all  triplets  { A ,  B,  C },  {A,  B}  C  {Ad,  X2,  X3,  Yu  Y2,  Y3}, 
C  e  O,  we  have  pab  f  0 ,  Pab.c  f  0.  Then  Ad  and  Y\  do  not  have  a  common  parent  in  G 
with  probability  1  with  respect  to  a  Lebesgue  measure  over  the  coefficient  and  error  variance 
parameters. 

The  proofs  for  this  lemma  and  for  many  other  results  in  this  report  are  given  in  the 
Appendix. 

Let  the  predicate  F\  ( X ,  Y,  G )  be  true  if  and  only  if  there  exist  two  observed  nodes  W  and 
Z  in  latent  variable  graph  G  such  that  ctxy&wz  =  &x w&yz  =  cxzVyw  holds,  all  variables 
in  {W,  X,Y,  Zj  are  correlated,  and  there  is  no  observed  C  in  G  such  that  pab.c  =  0  for 
{A,  B}  C  {W,  X,Y,  Z}.  A  second  constraint  set  rule,  CS2,  is  as  follows: 


_ For  variables  S  =  {Ad,  Ad,  Ad,  Yu  Y2,  Y3},  if _ 

Ad  is  not  an  ancestor  of  X3  and  Y\  is  not  an  ancestor  of  Y3 
T\  (Ad,  Ad,  G )  =  true  and  id  (hi,  Y2,  G)  =  true 
Pab  f  0,  pab.c  f  0,  for  all  {A,  B}  C  S,  C  G  O 

TyiYiOA2y2  =  TY1Y2TY2Y1 
7X2Y17y2Y3  =  0’x2Y30'Y2Y1 
(7x1X2(JX3Y2  =  0XlY20x3X2 

_ TYiAz^YiYz  f  TYiYaTYaYi _ 

then  X\  and  Y\  cannot  have  a  common  parent  in  a  linear  latent  variable  model 


The  correctness  of  such  rule  is  given  by  the  following  lemma: 

Lemma  5  Let  G( O)  be  a  linear  latent  variable  jnodel.  Assume  {X3,  X2,  X3,Y3,Y2,  Y3}  C 
O,  Ad  is  not  an  ancestor  of  X3,  Yt  is  not  an  ancestor  of  Y3,  id  (Ad,  Ad,  G)  =  true, 
F\  (Y\ ,  Y2,  G)  =  1  and  o’Xiy1o'x2y2  =  &Xiy2vx2Yi,  cx2y3cy2y3  =  crx2Y3crY2Yi,  oXiX2oX 3y2  = 
o~Xiy2<7x3x2,  <7Xlx2ay1Y2  f  and  that  for  all  triplets  {A,  B,  C},  {A,  B}  C  {Ad,  Ad, 

x3,YuY2,Y3},c  e  O,  we  have  pab  f  0,  pab.c  f  0.  Then  Ad  and  Y\  do  not  have  a  common 
parent  in  G. 
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In  the  next  section  we  will  show  empirical  ways  of  testing  the  premises  of  CS2  concerning 
ancestral  relations.  A  third  constraint  set  rule,  CS3,  is  as  follows: 


_ For  variables  S  =  {XUX2,  X3,  Yu  Y2,  Y3},  if _ 

Pab  ^  0,  pab.c  ~f~  0,  for  all  {A,  B}  C  S,  C  G  O 

&X1.Y1  &Y2Y3  —  aX1Y2(JY1Y3  =  0’x1Y30’Y1Y2 
0XiY20x2X3  =  CXl  X2Cy2X3  =  &X1X3  &X2Y2 
CrX1Y3&X2X3  =  O’ Xx  X2&Y3X3  =  0Xl X3Ox2Y3 

_ Qx1x2Qy2y3  OX1Y2OX2Y3 _ 

then  X\  and  Y\  cannot  have  a  common  parent  in  a  linear  latent  variable  model 


The  correctness  of  such  rule  is  given  by  the  following  lemma: 

Lemma  6  Let  G(0)  be  a  linear  latent  variable  model.  Assume  {Xi,  X2,  X3,  Yi,  Y2,  Y3}  C 
O  and  <JxiYxoy2y3  =  o’x1y2o’y1y3  =  or Xiyzctypy2,  o~x1y2o'x2x3  =  oXix2oy2x3  =  or xxx3o x2y2J 
o xxy3o x2x3  =  orXlx2aY3x3  =  or Xix3Ox2y3,  orXlX2orY2Y3  7^  otXiy2Ox2y3  and  that  for  all  triplets 
{A,  B ,  C},  {A,  B}  C  {Ah,  Ah,  A"3,  Yi,  Y2l  Y3},  C  e  O,  we  have  pAB  0,  pAB.c  ^  0-  Then  Ah 
and  Y\  do  not  have  a  common  parent  in  G. 

CS3  has  an  important  difference  with  respect  to  the  others:  one  can  show  that  the  as¬ 
sumption  of  full  linearity  is  necessary. 

Lemma  7  CSS  is  ?iot  sound  for  semilinear  latent  variable  models. 

We  are  now  able  to  give  an  to  the  question  presented  at  the  end  of  Section  3.2.  Let  E 
be  an  observable  covariance  matrix,  and  LT(E)  the  set  of  all  linear  latent  variable  graphs 
that  entail  all  and  only  the  tetrad  and  vanishing  partial  correlation  constraints  in  E,  and  let 
«ST(E,E^)  the  set  of  all  semilinear  latent  variable  graphs  with  latent  covariance  matrix  YB 
that  entail  all  and  only  the  tetrad  and  vanishing  partial  correlation  constraints  in  E.  We  say 
that  G  G  LTm( E)  if  the  measurement  model  of  G  is  the  measurement  model  of  some  graph 
in  LT( E),  and  a  similar  definition  describes  STm(E,  E^).  We  have  the  following  theorem  as 
a  direct  result  from  the  previous  two  lemmas: 

Theorem  2  There  is  some  E B  such  that  LTm{ E)  and  STm(E,Ex)  are  not  equal. 

Therefore,  we  can  gain  more  discriminative  power  if  we  assume  that  the  true  graph  is  a 
linear  latent  variable  graph  in  the  class  of  tetrad  constraints.  However,  we  only  know  one 
rule  that  is  provably  not  valid  for  semilinear  graphs,  and  it  is  the  most  constrained  of  all, 
which  makes  the  extra  assumption  of  full  linearity  not  particularly  attractive.  Still,  it  holds 
for  multivariate  normal  distributions,  a  very  important  pratical  case.  More  importantly  from 
the  point  of  view  of  causality  discovery,  the  known  methods  for  learning  a  structural  model 
(Silva,  2002)  require  full  linearity. 

Before  we  move  to  the  next  section,  it  is  interesting  to  state  the  following: 
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(b) 


(c) 


Figure  3:  Three  examples  with  two  main  latents  and  several  independent  latent  common 
causes  of  two  indicators  (represented  by  double-directed  edges).  In  (a),  CS1  applies,  but  not 
CS2  nor  CS3  (even  when  exchanging  labels  of  the  variables);  In  (b),  CS2  applies  (assuming 
the  necessary  F\  conditions  hold),  but  not  CS1  nor  CS3.  In  (c),  CS3  applies,  but  not  CS1 
nor  CS2. 

Proposition  1  CS1,  CS2  and  CSS  are  logically  independent. 

In  other  words,  the  rules  presented  in  this  section  are  not  redundant.  Figure  3  depicts 
three  situations  where  only  one  of  each  rule  can  be  applied. 

4.3  Discovering  other  features  of  latent  variable  graphs 

It  is  possible  in  many  cases  to  tell  if  an  observed  node  is  not  an  ancestor  of  another. 

Lemma  1  Let  G( O)  be  a  semilinear  latent  variable  graph.  For  some  set  O'  =  { A ,  R,  C,  D}  C 
O,  if  ctab&cd  =  cfacVbd  =  vad°bc  and  for  all  triplets  {X,  Y,  Z},  {A",  Y}  C  0',Z  e  O, 
we  have  Pxy.z  f  0  and  pXy  f  0,  then  no  element  in  X  e  O'  is  an  ancestor  of  any  element 
in  0'\X  in  G  with  probability  1  with  respect  to  a  Lebesgue  measure  over  the  coefficient  and 
error  variance  parameters. 

There  are  certainly  other  features  of  interest  in  a  measurement  model,  such  as  which 
nodes  do  have  a  common  parent,  how  many  parents  are  common,  and  if  a  node  is  a  parent  of 
another.  However,  tetrad  constraints  are  quite  limited  with  respect  to  these  other  features: 
going  back  to  the  linear  case  and  the  Tetrad  Representation  Theorem,  one  can  see  that  the 
lack  of  a  choke  point  can  be  explained  in  many  different  ways,  from  the  existence  of  multiple 
common  parents  to  even  the  fact  that  one  node  is  a  parent  of  another  observed  node.  There 
is  very  little  that  can  be  done  for  these  other  features  within  a  tetrad  equivalence  class,  but 
there  are  two  alternatives. 

The  first  one  is  to  use  tetrad  constraints  only  to  initialize  a  model  by  excluding  common 
parents  and  possible  observed  ancestors  where  we  know  they  should  not  exist.  Then,  pro¬ 
ceed  with  a  standard  algorithm  for  learning  Bayesian  network  structures.  There  are  many 
heuristic  search  algorithms  that  can  work  reasonably  well  in  practice  when  the  starting  point 
is  close  to  the  true  graph  (e.g.,  Elidan  et  ah,  2000).  However,  no  theoretical  guarantees  of 
consistency  are  known. 
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A  second  alternative  is  to  select  a  subset  of  variables  where  there  are  no  other  major 
features  to  be  discovered,  i.e.:  for  every  pair  of  nodes  we  know  if  they  share  exactly  one 
parent  or  none,  and  observed  nodes  cannot  be  parents  of  another  observed  nodes.  Under  the 
given  constraints,  we  can  cluster  all  variables  into  groups  that  share  a  single  or  no  common 
parent.  This  process  is  called  purification  and  can  be  done  entirely  under  a  tetrad  equivalence 
class  with  theoretical  guarantees.  This  alternative  will  be  explored  in  detail  in  Section  5. 

4.4  Algorithm 

We  now  use  the  information  that  can  be  obtained  by  tetrad  constraints  and  vanishing  partial 
correlations  in  a  learning  algorithm.  First,  one  has  to  notice  that  it  is  difficult  to  design  a 
principled  score-based  algorithm  for  learning  measurement  models  because  in  general  there  is 
no  known  notion  of  score  equivalence,  i.e.,  how  to  describe  which  structures  will  correspond 
to  the  same  score.  So  far,  we  do  not  have  a  characterization  of  which  measurement  models 
will  be  score  equivalent  for  any  kind  of  score  function  based  on  the  likelihood  or  posterior 
distribution  of  latent  graphs.  In  this  work,  we  will  focus  mainly  in  constraint-based  search 
algorithms  that  has  the  property  of  Fisher  consistency:  given  infinity  data,  the  output  is 
guaranteed  to  have  specific  properties. 

Assume  for  now  that  the  population  covariance  matrix  E  is  known.  Let  C  be  the  set 
of  tetrad  and  vanishing  partial  correlation  constraints  in  E,  and  M(C)  the  measurement 
model  equivalence  class  for  C.  We  define  a  generalized  measurement  pattern ,  or  GMP(C),  to 
be  a  graphical  object  representing  features  of  the  equivalence  class  M(C).  The  only  edges 
allowed  in  a  GMP  are  directed  edges  from  latents  to  observed  nodes,  and  undirected  edges 
between  observed  nodes.  Every  observed  node  in  a  GMP  has  at  least  one  latent  parent.  If 
two  observed  nodes  X  and  Y  in  a  GMP(C )  do  not  share  a  common  latent  parent,  then  X 
and  Y  do  not  share  a  common  latent  parent  in  any  member  of  M(C).  If  X  and  Y  are  not 
linked  by  an  undirected  edge  in  G  AT  P[C),  then  X  is  not  an  ancestor  of  Y  in  any  member  of 
M(C). 

Let  FindPattern  be  the  algorithm  described  in  Table  1.  Then: 

Theorem  4:  The  output  of  FindPattern  is  a  generalized  measurement  pattern  GMP(C ) 
with  respect  to  the  tetrad  and  vanishing  partial  correlation  constraints  of  E. 

A  measurement  pattern  also  provides  lower  bounds  on  the  number  of  underlying  latent 
variables:  a  bound  can  be  obtained  from  the  size  of  any  clique  in  the  complement  of  graph 
C  as  defined  in  Table  1. 

Proposition  2  Let  C'  be  the  complement  of  graph  C  obtained  at  the  end  of  Step  3  of  algo¬ 
rithm  FindPattern,  and  let  d  be  the  size  of  any  clique  in  C' .  Then,  there  are  at  least  d 
latents  in  the  unknown  latent  variable  graph. 

Proof:  Follows  directly  from  the  fact  that  two  neighbors  in  C'  correspond  to  two  observed 
variables  that  do  not  share  a  common  parent,  by  the  soundness  of  CS1,  CS2  and  CS3.  Since 
no  two  elements  have  a  common  parent  in  the  clique,  there  is  at  least  one  latent  for  each 
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Algorithm  FlNDPATTERN 
Input:  a  covariance  matrix  E 

1.  Start  with  a  complete  graph  C  over  the  observed  variables. 

2.  Remove  edges  of  pairs  that  are  marginally  uncorrelated  or  uncorrelated  conditioned  on 
a  third  variable. 

3.  For  every  pair  of  nodes  linked  by  an  edge  in  C ,  apply  successively  rules  CS1,  and 
CS2/CS3,  if  wanted.  Remove  an  edge  between  every  pair  corresponding  to  a  rule  that 
holds.  Stop  when  it  is  not  possible  to  apply  any  rule. 

4.  Let  G  be  a  graph  with  no  edges  and  with  nodes  corresponding  to  observed  variables. 

5.  For  each  maximal  clique  on  C,  add  a  new  latent  to  G  and  make  it  a  parent  to  all 
corresponding  nodes  in  the  clique. 

6.  For  each  pair  of  nodes  ( A,  B ),  if  there  is  no  other  pair  ( C,D )  such  that  oab&bd  = 
vacVbd  =  vadCbc,  add  an  undirected  edge  between  A  and  B. 

7.  Return  G. 


Table  1:  Returns  the  generalized  measurement  pattern  of  a  latent  variable  graph. 


element  in  the  clique.  □ 

Notice  we  only  use  partial  correlations  with  up  to  1  variable  in  the  conditioning  set.  In 
principle,  the  algorithm  can  start  with  a  DAG  obtained  from  a  standard  structure  learning 
algorithm  (again,  this  is  how  the  heuristic  given  in  Heckerman  (1998)  works),  but  we  choose 
to  ignore  this  extra  information  to  avoid  extra  statistical  decisions.  Since  we  are  assuming 
that  observed  variables  are  heavily  connected  by  hidden  common  causes,  there  is  little  to 
be  gained  from  conditional  independence  constraints.  Also,  since  a  DAG  over  the  observed 
variables  should  be  very  dense  under  such  assumptions,  the  computational  cost  of  testing  all 
necessary  partial  correlations  might  be  prohibitive. 

Even  though  the  measurement  pattern  is  limited  in  information,  it  is  still  useful  for  data 
mining  purposes:  it  provides  an  indication  of  possible  underlying  latent  concepts.  However, 
a  more  informative  graph  can  be  obtained  if  we  are  willing  to  select  only  a  subset  of  the 
variables  given  as  input.  Next  section  discuss  what  purified  patterns  are,  and  which  desirable 
properties  they  have. 


5  Purification  and  identifiability 

In  Spirtes  et  al.  (2000)  and  Silva  et  al.  (2003)  we  discuss  a  special  class  of  measurement 
models  called  pure  measurement  models. 
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Definition  6  (Pure  measurement  model)  Let  G  be  a  latent  variable  graph.  A  pure  mea¬ 
surement  model  for  G  is  a  measurement  submodel  of  G  in  which  each  observed  variable  is 
d-separated  from  every  other  variable  conditional  on  one  of  its  latent  parents,  that  is,  it  is  a 
tree  beneath  the  latents. 

Therefore,  in  pure  measurement  models,  observed  variables  should  have  one  and  only 
(latent)  parent.  Pure  measurement  models  are  shown  to  be  useful  in  Silva  (2002)  as  a 
principled  way  of  testing  conditional  independence  among  latents.  Also,  Silva  et  al.  (2003) 
designed  an  algorithm  for  learning  measurement  models  from  data  that  allows  one  to  identify 
every  latent  in  the  true  unknown  latent  graph  that  generated  the  data,  as  well  as  at  least 
three  of  the  indicators  of  each  latent,  as  long  as  the  measurement  model  is  pure.  This  is  done 
by  selecting  a  subset  of  the  given  observed  variables.  Also  important,  as  observed  by  Silva  et 
al.  (2003),  learning  a  pure  measurement  model  of  the  latents  is  a  task  much  more  robust  to 
sample  variability  then  attempting  to  learn  the  less  constrained  measurement  pattern.  We 
concluded  that  is  better  to  learn  a  submodel  (i.e.,  using  only  a  subset  of  the  given  variables) 
that  is  more  reliable  than  trying  to  learn  a  more  complete  model  that  is  more  prone  to  be 
the  result  of  several  statistical  mistakes. 

ffowever,  as  discussed  in  Section  2,  we  do  not  want  to  make  the  same  assumptions3  as 
in  Silva  et  al.  (2003).  Because  of  that,  we  will  lose  the  ability  of  identifying  each  latent  in 
the  true  unknown  graph,  and  the  latents  appearing  in  the  final  output  of  our  algorithm  may 
also  correspond  to  more  than  one  latent  in  the  true  graph.  As  important  advantages,  this 
approach  not  only  relies  on  less  untestable  assumptions,  but  also  has  desirable  properties  of 
anytime  computation,  i.e.,  it  gives  you  results  even  when  computation  is  interrupted  before 
the  end.  The  anytime  properties  of  our  algorithm  will  be  discussed  in  the  next  section. 

Consider  the  following  algorithm  for  creating  a  pure  model  from  a  GMP  found  by  FlND- 
PATTERN:  make  it  pure  by  removing  all  nodes  that  have  more  than  one  latent  parent  or  are 
adjacent  to  another  observed  variable.  This  improves  what  we  know  about  the  measurement 
model  in  the  true  graph  G  among  the  variables  now  remaining.  For  example,  we  know  that 
each  remaining  measured  variable  is  d-separated  from  all  other  remaining  measured  variables 
given  its  latent  parents  in  the  true  graph  G,  which  is  crucial  for  discovering  features  of  the 
structural  model  in  G  (Spirtes  et.  al,  2000,  Chapter  12).  Even  a  purified  GMP  is  not,  how¬ 
ever,  necessarily  complete  with  respect  to  features  of  the  measurement  model  equivalence 
class.  Two  observed  variables  that  share  a  parent  in  the  purified  GMP  might  not  share  a  sin¬ 
gle  latent  parent  in  the  true  latent  variable  graph.  Therefore,  this  GMP  cannot  parameterize 
a  measurement  model  where  observed  variables  are  linear  functions  of  their  parents. 

We  have  not  defined,  however,  how  a  GMP  does  or  does  not  entail  a  constraint.  Instead 
of  doing  so  directly,  we  introduce  the  concept  of  an  l-interpretation  (“latent  interpretation”), 
in  order  to  parameterize  the  measurement  model  given  in  a  purified  GMP.  The  constraints 
entailed  by  the  1-interpretation  are  a  subset  of  the  constraints  entailed  by  the  measurement 
model  of  the  true  latent  variable  graph  G,  a  variant  of  I-maps  (Pearl,  2000)  for  tetrad 
constraints: 


3Silva  et  al.  assume  that  the  true  model  has  a  pure  submodel  with  at  least  three  indicators  for  each 
latent,  a  much  stronger  assumptions 
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Algorithm  BuildPureClusters 
Input:  a  covariance  matrix  E 


1.  G  ^FindPattern(E). 

2.  Choose  a  set  of  latents  in  G.  Remove  all  other  latents  and  all  observed  nodes  that  are 
not  children  of  the  remaining  latents. 

3.  Remove  all  nodes  that  have  more  than  one  latent  parent  in  G. 

4.  For  all  pairs  of  nodes  linked  by  an  undirected  edge,  remove  one  element  of  each  pair. 

5.  If  for  some  set  of  nodes  {A,  B,C},  all  children  of  the  same  latent,  there  is  a  fourth 
node  D  in  G  such  that  (Tab^cd  =  ctac&bd  =  oad&bc  is  not  true,  remove  one  of  these 
four  nodes. 

6.  If  for  some  pair  of  nodes  {A,  B},  both  children  of  the  same  latent,  and  another  pair  of 
nodes  {C,  D}  we  have  oac&bd  ^  o'adO'bc,  remove  one  of  these  four  nodes. 

7.  Remove  all  latents  with  no  children. 

8.  Return  G. 


Table  2:  An  algorithm  for  obtaining  a  pure  1-interpretation. 

Definition  7  Given  a  latent  variable  graph  G(0)  whose  measurement  model  entails  a  set 
of  constraints  C,  an  l-interpretation  X(O')  of  G  for  O'  C  O  is  a  latent  variable  graph  such 
that  the  measurement  model  of  I  entails  only  constraints  in  C. 

BuildPureClusters,  an  algorithm  to  create  a  1-interpretation  for  the  unknown  true 
graph,  is  given  in  Table  2.  The  output  is  a  pure  generalized  measurement  pattern,  or  simply 
a  pure  measurement  model.  It  does  not  specify  how  choices  in  specific  steps  are  made  (e.g., 
which  latents  should  be  chosen  in  Step  2),  and  implementation  details  will  be  postponed  to 
Section  6.4.  It  is  clear  that  a  generalized  measurement  pattern  becomes  a  pure  measurement 
model  when  we  remove  all  nodes  that  have  more  than  one  parent  and  some  observed  neigh¬ 
bor.  And  of  course  there  are  trivial  1-interpretations,  such  as  complete  graphs.  However, 
not  all  1-interpretations  are  pure  generalized  measurement  patterns.  The  following  theorem 
states  that  both  properties  hold  for  the  output  of  BuildPureClusters: 

Theorem  5  Let  G(0)  be  a  latent  variable  graph.  Then  the  output  of  BuildPureClus¬ 
ters  is  a  valid  l-interpretation  for  G  in  the  family  of  tetrad  and  vanishing  partial  correlation 
constraints  and  a  pure  generalized  measurement  pattern. 

One  can  also  show  that: 

Lemma  16  Let  G(O)  be  a  latent  variable  graph  with  latent  covariance  matrix  E L.  For 
any  set  {A,B,C,D}  =  O'  C  O,  if  crABaCD  =  vACvBd  =  ^ad^bd  and  for  every  set 
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{X,  Y}  c  O',  z  e  O  we  have  Pxy.z  ^  0  and  pxy  ^  0,  then  if  A  and  B  have  a  common  latent 

parent  Lx  in  G,  B  and  C  have  a  common  latent  parent  L2  in  G,  we  have  Li  =  L2  with  proba¬ 

bility  1  with  respect  to  a  Lebesgue  measure  over  the  coefficient  and  error  variance  parameters. 

The  1-interpretation  output  by  BuildPureClusters  can,  in  some  circumstances,  tell 
us  a  lot  about  the  true  latent  variable  graph.  Let  O'  be  the  set  of  observed  nodes  in  the 
pure  measurement  pattern  P  obtained  by  applying  BuildPureClusters  to  the  covariance 
matrix  generated  by  a  true  latent  variable  graph  G.  Let  a  cluster  be  a  set  of  nodes  that  are 
children  of  the  same  latent  parent  in  P.  We  can  infer  the  following  graphical  features  of  G 
from  P : 

•  Nodes  in  different  clusters  in  P  do  not  have  a  common  parent  in  G 

•  For  all  pairs  {X,  Y }  e  O',  X  cannot  be  an  ancestor  of  Y  in  G: 

•  Let  Co  be  a  cluster  of  P  with  at  least  3  elements,  and  assume  P  has  at  least  four 

observed  variables.  Then  if  any  subset  of  Co  share  a  common  parent  in  G,  then  this 
subset  has  an  unique  common  parent  in  G. 

Thus,  P  forms  a  clustering  that  may  be  coarser  than  the  one  in  G.  That  is,  when  a  set  of 
variables  has  a  single  common  cause  in  P,  then  G  may  partition  the  variables  in  the  cluster 
having  separate  latent  common  causes.  How  far  could  we  refine  the  clustering  in  P  is  a  topic 
for  future  research.  Silva  et  al.  (2003)  describe  a  set  of  assumptions  sufficient  to  obtain 
a  1-to-l  correspondence  between  each  latent  in  P  and  each  latent  in  G;  the  assumptions 
include  the  requirement  that  a  sub-model  of  G  has  a  pure  measurement  model  with  at  least 
3  indicators  per  latent. 

As  a  final  note,  notice  it  is  possible  that  some  tetrad  constraints  exist  in  the  population 
but  are  not  represented  in  the  purified  output.  For  instance,  if  there  is  a  triplet  of  fully 
connected  Intents  {L i,  L2}  L3 }  such  that  Pl2l3.Li  =  0,  then  there  will  be  one  tetrad  contraint 
with  one  indicator  of  L2,  one  of  L3  and  two  from  L3  that,  by  the  definition  of  entailment  in 
measurement  models,  will  not  be  entailed  by  the  output  graph  (since  the  definition  requires 
that  any  entailed  constraint  should  hold  for  any  choice  of  latent  covariance  matrix).  However, 
this  is  of  no  importance  as  far  as  learning  1-interpretations  goes. 

6  Statistical  learning  and  practical  implementations 

There  are  computational  and  statistical  issues  with  the  theoretical  specification  of  Build¬ 
PureClusters  that  have  to  be  approached  in  a  practical  implementation.  The  computa¬ 
tional  cost  of  the  procedure  is  apparently  excessive,  there  are  steps  that  are  not  fully  specified 
(such  as  Step  2  of  BuildPureClusters)  and  one  has  to  define  how  to  deal  with  statistical 
issues  since  only  a  sample  covariance  matrix  will  be  available. 

In  the  next  section,  we  will  first  describe  the  anytime  properties  of  the  general  algorithms 
described  in  Sections  4.4  and  5.  We  then  brief  explain  how  to  adapt  our  method  to  model 
discrete  distributions.  This  is  followed  by  a  discussion  on  statistical  learning  of  graphical 
models  using  constraint-satisfaction  and  model  scoring  and  how  it  is  related  to  the  problem 
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of  learning  within  the  tetrad  equivalence  class.  We  conclude  our  discussion  about  practical 
implementations  by  describing  in  full  detail  an  algorithm  that  can  be  readily  implemented 
with  heuristics  that  we  believe  to  be  useful  in  real-world  applications. 

6.1  Anytime  properties 

The  algorithm  in  Silva  et  al.  (2003)  had  the  property  of  being  able  to  identify  all  and  only 
the  latents  in  the  true  unknown  measurement  model,  given  the  assumptions  and  the  true 
covariance  matrix.  This  is  a  stronger  claim  than  the  one  given  in  Theorem  5,  which  concerns 
1-interpretations  and  and  generalized  measurement  patterns,  and  might  not  only  collapse 
different  latents  into  one,  but  also  throw  away  some  of  the  latents  found  in  the  true  graph. 

However,  in  order  to  learn  a  measurement  model  with  such  guarantees,  besides  the 
stronger  assumptions  the  algorithm  of  Silva  et  al.  (2003)  also  required  the  enumeration 
of  all  maximal  cliques  of  graph  C  (as  described  in  Table  1).  The  number  of  maximal  cliques 
can  be  quite  large,  especially  if  data  are  noisy  and  many  edges  of  C  are  erroneously  removed 
or  kept.  Moreover,  an  auxiliary  graph  has  to  be  built,  where  each  node  corresponds  to  a 
clique  in  C.  A  maximum  clique  has  to  be  found  in  this  new  graph,  which  is  a  well-known 
NP-hard  problem  without  any  efficient  approximation  algorithm.  In  constrast,  the  weaker 
features  of  a  1-interpretation  allow  a  formal  description  on  how  to  interpret  the  output  when 
only  partial  information  is  provided. 

There  is  a  stage  in  FindPattern  where  finding  all  maximal  cliques  of  a  graph  seems 
to  be  necessary.  In  fact,  it  is  not.  Identifying  more  cliques  will  only  increase  the  chance 
of  having  a  larger  output  by  the  end  of  the  algorithm  (which  is  good).  As  hinted  by  the 
freedom  of  choice  in  Step  2  of  BuildPureClusters,  stopping  Step  5  of  FindPattern 
after  a  given  amount  of  time  will  not  affect  the  result  estabilished  by  Theorem  5.  Another 
concern  are  the  0(N 6)  loops  on  Step  3  of  FindPattern,  N  being  the  number  of  variables. 
Still,  computing  this  set  of  loops  is  not  a  fundamental  limitation  if  there  is  not  enough  com¬ 
putational  resources  to  accomplish  it.  One  can  stop  Step  3  at  any  time  at  the  price  of  losing 
information,  but  not  the  theoretical  guarantees  of  Theorem  5.  This  is  summarized  by  the 
following  corollary: 

Corollary  1  Let  G( O)  be  a  latent  variable  graph.  Then  the  output  of  BuildPureClus¬ 
ters  is  a  l-interpretation  for  G  in  the  family  of  tetrad  and  vanishing  partial  correlation 
constraints  even  when  rules  CS1,  CS2  and  CSS  are  applied  an  arbitrary  number  of  times  in 
FindPattern  for  any  arbitrary  subset  of  nodes  and  an  arbitrary  number  of  maximal  cliques 
is  found. 

In  other  words,  one  can  stop  the  loop  at  Step  3  of  FindPattern  at  any  moment,  as  well 
as  the  one  at  Step  5,  and  still  get  a  theoretical  guarantee  of  consistency.  There  is  a  clear 
trade-off  in  this  procedure:  the  longer  one  keeps  such  loops  running,  the  more  likely  there 
will  be  more  nodes  in  the  final  purified  pattern,  and  the  more  informative  it  will  be  since 
nodes  of  different  latents  that  in  principle  can  be  separated  might  not  be  if  the  proper  test 
was  not  applied. 
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6.2  Discrete  models 

Although  we  do  not  perform  any  experiments  with  discrete  models  in  this  report,  it  is  worthy 
mentioning  that  there  are  relatively  straightforward  ways  of  adapting  the  algorithms  here 
discussed  to  discrete  distributions.  We  can  build  on  the  same  ideas  used  in  discrete  factor 
analysis.  The  closest  framework  is  the  underlying  variable  approach,  where  observed  vari¬ 
ables  are  assumed  to  be  discretizations  of  some  unobserved  underlying  continuous  variable. 
The  underlying  variables  are  then  indicators  of  another  set  of  latents,  in  the  same  way  our 
observed  continuous  variables  are  associated  by  a  layer  of  hidden  common  causes.  Tetrad 
constraints  will  hold  for  some  sets  of  underlying  variables,  basically  carrying  on  the  same 
algorithm  for  another  level  of  unobserved  variables. 

In  order  to  test  tetrad  constraints  among  underlying  variables,  one  needs  to  assume  a 
probabilistic  model  for  latent  variables,  where  the  probability  mass  of  an  underlying  variable 
in  a  given  range  will  correspond  to  the  observed  probability  of  a  discrete  variable  assuming 
some  value.  To  test  a  set  of  tetrad  constraints,  one  will  need  to  fit  a  particular  submodel 
that  entails  those  tetrads.  This  is  computationally  expensive,  since  it  will  require  numerical 
integration  over  the  respective  ranges  that  each  underlying  variable  spam  for  each  combina¬ 
tion  of  values  of  the  observed  discrete  variables.  Bartholomew  and  Knott  (1999)  describe 
discrete  factor  analysis  in  detail. 

As  an  alternative,  one  could  assume  that  latent  and  observed  variables  are  binary.  In 
this  situation,  tetrad  constraints  will  still  hold  (Pearl,  1988).  However,  in  our  preliminary 
experiments  with  simulated  models,  statistical  tests  of  binary  tetrad  constraints  failed  to  be 
reliable. 

6.3  Statistical  learning 

Silva  et  al.  (2003)  argue  that  estimating  measurement  patterns  from  data  can  be  a  very 
difficult  task:  in  simulations,  the  outcome  was  that  the  empirical  patterns  had  considerably 
more  induced  latents  than  the  synthetic  models  from  which  we  sampled.  The  purified  mea¬ 
surement  models  obtained  from  such  patterns  were  quite  close  to  the  true  ones,  even  in  cases 
where  the  statistical  model  was  wrong  (i.e.,  assuming  Gaussian  distributions  where  data 
were  not  Gaussian).  Since  in  this  work  we  are  allowing  even  less  constrained  measurement 
models,  we  will  still  focus  on  the  estimation  of  pure  measurement  models  only.  Patterns  will 
be  estimated  as  an  intermediate  step,  but  our  goal  in  the  algorithms  here  described  is  to 
reliably  learn  pure  measurement  models  from  data. 

Given  a  sample  covariance  matrix,  one  cannot  expect  that  any  tetrad  constraints  will 
hold  exactly,  but  they  will  hold  approximately.  In  order  to  test  the  statistical  significance 
of  such  constraints,  Spirtes  et  al.  (2000)  use  a  normal  approximation  for  each  sample  tetrad 
difference  rjjrKL  —  riLrJK ,  where  rxy  is  the  sample  correlation  coefficient  of  A"  and  Y .  Mean 
and  variances  for  such  statistics  are  described  in  Wishart  (1928).  Bollcn  (1990)  describes 
an  asymptotically  distribution  free  test  of  vanishing  tetrads.  The  computational  cost  of  the 
later  test  may  slow  down  the  procedure  considerably,  since  Bollcn’s  procedure  requires  the 
fourth  moments  of  the  data  set.  Concerning  vanishing  partial  correlations,  Spirtes  et  al. 
(2000)  also  discuss  possible  tests. 

Therefore,  in  FlNDPATTERN  and  BuildPureClusters  one  could  plug-in  those  tests 
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in  order  to  verify  which  constraints  are  significant.  This  would  be  a  typical  “constraint- 
satisfaction”  approach  for  causality  discovery  in  graphical  models,  in  constrast  to  “score- 
based”  approaches  that  defines  a  score  function  for  a  model  and  a  set  of  operators  that 
generates  new  candidates  from  the  current  one.  There  is  a  clear  advantage  in  using  the  score- 
based  method,  in  a  sense  that  each  model  is  scored  as  a  whole  and,  therefore,  uses  a  more 
robust  account  of  the  quality  of  a  candidate  graph.  In  constrast,  a  constraint-satisfaction 
method  scores  parts  of  a  model  independently. 

However,  it  is  often  much  easier  in  latent  variable  models  to  define  a  consistent  search 
space  for  constraint-satisfaction  approaches,  since  it  is  possible  to  control  which  particular 
constraints  are  going  to  be  used.  While  a  typical  score  function  used  in  score-based  search  is 
a  function  of  the  posterior  distribution  of  the  graph  given  the  data,  for  general  latent  variable 
models  it  is  not  obvious  how  to  characterize  score-equivalent  models,  a  necessary  first  step 
to  even  start  considering  the  design  of  consistent  algorithms.  Even  if  such  equivalence  is 
proven,  there  is  still  a  major  problem  of  designing  a  computational  practical  algorithm  for 
consistent  estimation  of  the  true  graph.  Zhang  (2004)  does  describe  score-equivalent  groups 
of  latent  variable  models,  but  does  not  give  a  prove  of  consistency  for  his  hill-climbing  search 
procedure. 

In  our  preliminary  experiments  in  learning  pure  measurement  models,  it  is  often  the 
case  that  finding  out  which  indicators  should  not  be  clustered  together  is  a  quite  robust 
step  (under  the  implementation  we  describe  in  the  next  section).  However,  purification 
is  a  more  sensitive  step:  at  least  for  a  fixed  p-value  and  using  false  discovery  rates  to 
control  for  multiplicity  of  tests,  purification  by  constraint-satisfaction  often  throws  away 
many  more  indicators  than  necessary  when  the  number  of  variables  is  relative  small,  and 
does  not  eliminate  many  impurities  when  the  number  of  variables  is  too  large. 

Instead,  we  will  adopt  a  hybrid  constraint-satisfaction/score-based  approach.  The  first 
stage  consists  of  an  algorithm  to  cluster  variables  based  on  a  modification  of  FindPattern. 
An  implementation  of  a  modihed  purification  (Steps  5  and  6  of  BuildPureClustes)  is 
also  described,  which  will  be  based  on  a  greedy  hill-climbing  score-based  search  that  first 
heuristically  identify  extra  paths  among  indicators  that  are  not  intermediated  by  latents. 
Details  of  such  algorithms  will  be  covered  in  the  next  section.  In  the  rest  of  this  subsection, 
we  discuss  how  to  score  a  measurement  model  and  fit  its  parameters  to  a  given  data  set. 

For  our  algorithms  we  use  the  Bayesian  Information  Criterion  (BIC)  as  a  score  function 
under  a  multivariate  Gaussian  distribution.  Althought  one  can  claim  that  such  representa¬ 
tion  requires  strong  assumptions  about  the  joint  distribution  of  the  data,  it  is  still  largely  used 
as  the  parametric  family  of  choice  for  measurement  models  (Bollcn,  1989).  Such  assump¬ 
tions  might  not  too  harmful  considering  that  only  the  second  moments  of  the  distribution 
are  important  for  our  algorithms.  Section  7  shows  a  few  simulation  results  when  the  true  dis¬ 
tribution  is  far  from  normal.  Also,  the  essence  of  the  main  algorithm  as  discussed  in  Section 
6.4  is  not  affected  by  the  choice  of  probability  model,  althought  the  estimation  procedures 
as  discussed  next  will  need  to  be  modihed  if  one  wants  to  adopt  a  different  model. 

Another  concern  could  be  the  choice  of  BIC  as  score  function:  it  is  known  that  BIC  is  not 
a  consistent  approximation  of  the  posterior  of  a  latent  variable  model  (Rusakov  and  Geiger, 
2004).  BIC  is  used  in  our  framework  for  its  many  computational  advantages,  especially 
when  used  with  Structural  EM  (Friedman,  1998).  More  important,  we  will  show  through 
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simulations  in  Section  7  that  BIC  is  still  a  useful  approximation  to  use  in  our  problem. 

6.3.1  Parameterization  and  scoring 

For  our  Gaussian  probability  models,  first  we  will  assume  that  variables  are  centered  on  their 
means,  and  therefore  we  only  have  to  define  the  implied  covariance  matrix  of  the  distribution 
as  a  function  of  the  model  parameters.  Let  rj  be  the  vector  representing  the  latent  variables 
in  the  model  and  y  the  vector  of  observed  variables.  All  relationships  will  be  linear  under 
this  distribution,  with  additive  error  terms.  Let  e  represent  the  error  terms  associated  with 
observed  variables,  and  (  the  error  terms  of  latent  variables.  We  parameterize  the  direct 
effect  of  parents  on  the  respective  children  as  follows: 

y  =  Ayy  +  A,,  77  +  e 
77  =  B/y  +  C 

Matrices  Ay  and  A,,  can  be  very  sparse:  for  instance,  there  will  be  a  non-zero  entry  for 
A only  if  yj  is  a  parent  of  yt  in  our  model.  On  the  other  hand,  matrix  B  will  be  a  bottom 
triangular  matrix  with  zeroes  along  the  diagonal  and  above  it,  and  no  other  zero  entries.  This 
is  equivalent  to  a  fully  connected  subgraph  of  latent  variables,  representing  the  irrelevance 
of  the  actual  latent  structure  for  our  task.  Matrix  (  is  diagonal.  Notice  this  is  just  one  way 
of  enconding  an  arbitrary  positive  semidehnite  matrix. 

Let  O  =  {Ay,  A,,,B,<F,^}  be  the  parameter  set  of  our  model,  where  $  =  E[eeT],  the 
covariance  matrix  of  the  error  terms  of  observed  variables,  and  th  =  -E'[CCT]  •  We  will  denote 
by  Ew(0)  the  implied  covariance  matrix  of  ?/,  which  can  be  shown  to  be  as  follows: 

E,m(0)  =  (I-B)-1^(I-B)-T 

where  I  is  the  identity  matrix. 

Analogously,  the  implied  covariance  matrix  of  y  will  be  given  by 

Ew(0)  =  (I  -  Ay)-‘[A,E„(e)A,T  +  #](I -  Ay)-T 

Let  0  be  the  maximum  likelihood  estimator  of  0.  Let  d  be  the  number  of  parameters  in 
0  and  let  S  be  sample  covariance  matrix  of  the  observed  variables  and  N  the  sample  size. 
Then  the  BIC  score  of  a  measurement  model,  up  to  additive  constants,  will  be  given  by 

BIC  =  -io9|E„(0)|  -  tr( SE-ye))  -  ±log(N)  (1) 

where  fr(M)  denotes  the  trace  of  matrix  M  and  |M|  its  determinant. 

6.3.2  Estimation 

In  order  to  score  a  model,  one  has  to  find  the  maximum  likelihood  estimator  of  the  para¬ 
meters.  There  are  a  variety  of  methods  for  accomplishing  this,  including  gradient  based 
methods  and  expectation-maximization  variations.  However,  when  choosing  a  method  one 
has  to  consider  it  will  be  used  inside  a  computationally  expensive  search  method  to  find  a 
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good  fitting  model.  Since  Structural  EM  (Friedman,  1998)  is  a  natural  choice  for  efficient 
hill-climbing  search  among  latent  variable  models  which  we  adopt  in  the  algorithm  described 
in  Section  6.4,  an  EM  estimator  will  be  used. 

We  generalize  the  results  of  Rubin  and  Thayer  (1982)  in  order  to  allow  direct  effects  of 
observed  variables  on  other  observed  variables.  Also,  we  will  allow  correlated  error  terms 
of  observed  variables,  i.e.,  $  will  be  allowed  to  be  an  arbitrary  symmetric  positive  definite 
matrix. 

Since  given  0  the  distribution  is  jointly  normal,  from  standard  results  in  linear  regression, 
the  conditional  distribution  rj  given  y  can  be  obtained  from  Era(0),  Ey2(0)  and  E22(@), 
where  it  can  be  shown  that 


Sy2(0)  =  (I  —  Aj/)^1A,?S,m(0) 

and  the  conditional  distribution  of  t]  given  y  is  a  multivariate  normal  with  mean  Ay  and 
covariance  A  given  by: 

S  =  s-‘(e)E„(e) 

A  =  E,„(e)  -  E£(8),5 

Therefore,  the  expectation  step  of  the  algorithm  is  reduced  to 

£[SJS,0]  =  s 

E[Ey2|S,0]  =  S5 

£[E22|S,0]  =  <5TS5  +  A 

where  S  is  the  sample  covariance  matrix. 

Once  a  full  correlation  matrix  of  observed  and  latent  variables  is  obtained,  we  need  to 
estimate  the  parameters  of  the  model.  Non- zero  non-diagonal  entries  in  the  error  covariance 
matrix  are  represented  by  bidirected  edges  in  the  graph  to  indicate  extra  hidden  common 
causes  of  a  pair  of  variables  that  are  independent  of  the  other  latents.  We  apply  the  algo¬ 
rithm  of  Drton  and  Richardson  (2003)  using  the  joint  expected  covariance  matrix  of  latents 
and  observed  variables.  We  do  not  use  straightforward  maximum  likelihood  estimation,  e.g., 
gradient-based  methods  or  closed-formula  regressions,  because  of  the  bidirected  edges:  un¬ 
constrained  maximization  might  result  in  non-positive  definite  implied  covariance  matrices, 
since  no  constraints  are  enforced  in  the  parameterization  of  bidirected  edges.  Drton  and 
Richardson’s  algorithm  explicitly  takes  into  account  bidirected  edges,  and  it  is  guaranteed 
to  converge  to  a  local  maximum. 

6.4  Actual  implementation 

The  main  problem  of  applying  FindPattern  directly  by  using  statistical  tests  of  tetrad 
constraints  is  the  number  of  false  positives:  accepting  a  rule  (CS1,  CS2,  or  CS3)  as  true 
when  it  does  not  hold  in  the  population.  One  can  see  that  might  happen  relatively  often 
when  there  are  large  groups  of  observed  variables  that  are  pure  indicators  of  some  latent: 
for  instance,  assume  there  is  a  latent  L0  with  10  pure  indicators.  Consider  applying  CS1 
to  a  group  of  six  pure  indicators  of  L$.  The  first  two  constraints  hold  in  the  population, 
and  so  assume  they  are  correctly  identified  by  the  statistical  test.  The  last  constraint, 
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(Jx1x2aY1Y2  a xxy2v x2yx ,  should  not  hold  in  the  population,  but  will  not  be  rejected  by 
the  test  with  some  probability.  Since  there  are  10! / (6!4!)  =  210  ways  of  CS1  being  wrongly 
applied  due  to  a  statistical  mistake,  we  might  get  many  false  positives.  The  problem  gets 
worse  if  there  is  a  pure  submodel  of  the  true  graph  with  many  latents  and  many  indicators 
per  latents  since  the  same  situation  can  happen  using  indicators  of  not  only  one  latent,  but 
multiple  ones,  and  this  can  be  observed  in  simulations. 

We  propose  here  a  modification  to  increase  the  robustness  of  FindPattern,  described 
in  detail  in  Table  3  —  the  RobustBuildPureClusters  algorithm:  add  a  first  step,  Fin- 
dInitialSelection  (Table  4),  where  we  decide  that  two  variables  X\  and  Y\  do  not  have 
common  parents  only  when  there  are  sets  X  and  Y,  X\  G  X,  Y\  G  Y  where  the  same  holds 
for  every  pair  in  X  x  Y.  In  this  case,  we  use  sets  of  size  three,  since  we  have  to  have  at  least 
six  variables  in  order  to  make  a  local  decision  of  nodes  not  sharing  a  same  parent.  Since  the 
number  of  constraints  tested  in  this  situation  is  much  higher,  there  is  a  considerably  smaller 
chance  of  an  acceptance  happening  by  statistical  coincidence. 

Once  we  generate  maximal  cliques  from  C  (as  defined  in  Table  4)  in  FindInitialSelec- 
TION  using  this  restricted  condition,  we  generate  an  intermediate  graph  H  where  each  clique 
from  C  is  represented  by  a  node  in  H .  For  each  pair  {Mj,  Mj}  of  nodes  in  H,  we  check  again 
if  there  is  a  group  X  of  nodes  in  the  clique  represented  by  Mj  and  a  group  Y  of  nodes  in  the 
clique  represented  by  Mj  such  that  every  pair  in  X  x  Y  satisfies  the  condition  of  disjoint 
parents.  An  edge  between  a  pair  Mj  and  Mj  will  be  added  only  if  such  condition  is  satisfied. 
Finally,  a  maximal  clique  of  nodes  in  H  is  selected  and  purified.  The  final  purified  model  is 
used  as  a  seed  for  the  next  step. 

The  actual  test  DisjointGroup  (Ah,  X2,  X3,  Fj,  Y2,  Y3;  E)  used  in  FindInitialSelec- 
tion  is  an  application  of  CS1  for  all  pairs  in  {Xi,X2,X3}  x  {bj ,  Y2,  Y:i}  using  only  nodes 
from  this  set.  Also,  we  add  an  extra  constraint:  for  every  pair  [Xt}  Xj\  c  {Ah,  Ah,  A3} 
and  every  pair  {Yp,Yq}  C  {Yi,Y2,Y3}  we  also  require  that  ax%Yp(TX,iYq  =  crXiYqaXjYp.  The 
motivation  is  that  we  are  looking  for  two  sets  of  three  indicators  each  from  two  different 
latent  variables,  where  these  constraints  will  hold.  The  extra  redundancy  will  then  help  to 
reduce  the  number  of  false  positives.  Notice  also  that  in  DisjointGroup  we  do  not  test  for 
vanishing  correlations:  it  is  verified  as  part  of  the  graphical  structure  of  Co-  In  the  experi¬ 
ments  in  the  next  section,  we  actually  do  not  make  use  of  the  vanishing  partial  correlations 
of  first  order,  reducing  the  set  of  statistical  decisions.  We  are  implicitly  assuming  that  no 
observable  conditional  d-separations  exist  in  the  true  model. 

Looking  for  triplets  of  indicators  of  two  distinct  latents  is  also  a  motivation  for  defining 
yellow  edges  in  FindInitialSelection.  If  two  nodes  cannot  be  separated  but  also  cannot 
be  in  the  same  cluster  in  a  purified  1-interpretation  with  two  latents  and  three  indicators 
each  (which  would  entail  the  constraints  in  DisjointGroup),  then  it  is  of  no  use  to  add 
both  to  our  initial  selection. 

FlNDMAXIMALCLIQUES  can  be  any  algorithm  for  finding  maximal  cliques.  We  used  the 
one  described  in  Bron  and  Kerbosch  (1973).  A  different  matter  is  ChooseClustering- 
CLIQUE  which  we  will  describe  as  follows:  since  the  number  of  cliques  (maximal  or  not)  in 
H  can  be  large,  we  will  be  interested  only  in  the  clustering  that  satisfies  a  given  optimality 
condition  (where  a  clustering  is  a  set  of  clusters,  i.e.,  a  set  of  sets  of  indicators  which  in  the 
end  will  correspond  to  a  pure  model).  Such  condition  should  be  associated  with  the  number 
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of  indicators  that  remain  in  the  model  after  purification.  We  will  search  for  a  good  clustering 
greedily  without  enumerating  all  cliques.  First,  we  define  the  size  of  a  clustering  Hcandidate 
(a  set  of  nodes  from  H ,  which  means  a  set  of  sets  of  nodes  in  O,  where  each  node  in  H 
is  a  cluster)  as  the  number  of  indicators  that  remain  according  to  the  following  elimination 
criteria:  1.  eliminate  all  indicators  that  appear  in  more  than  one  cluster  inside  Hcandidate ;  2. 
for  each  pair  of  indicators  {/i,  J2}  such  that  I\  and  J2  belong  to  different  clusters  in  Hcandidate, 
if  there  is  an  edge  Ji  —  J2  in  C,  then  we  remove  one  element  {Ii,I2}  from  Hcandidate  (i.e., 
guarantee  that  no  pair  of  indicators  from  different  clusters  which  were  not  shown  to  have 
any  common  latent  parent  will  exist  in  Hcandidate).  We  eliminate  the  one  that  belongs  to 
the  largest  cluster,  unless  the  smallest  cluster  has  less  than  three  elements  to  avoid  extra 
fragmentation;  3.  eliminate  clusters  that  have  only  one  indicator. 

The  optimality  condition  will  be  finding  a  clustering  of  largest  size.  The  assumption  is 
that  a  model  with  a  large  size  will  have  a  large  number  of  indicators  after  purification.  Our 
suggested  heuristic  to  be  implemented  as  ChooseClusteringClique  is  trying  to  find  a 
good  model  using  a  very  simple  hill-climbing  algorithm  that  starts  from  an  arbitrary  node 
in  H  and  add  new  clusters  to  the  current  candidate  according  to  the  one  that  will  increase 
its  size  mostly  while  still  forming  a  maximal  clique  in  H .  We  stop  when  we  cannot  increase 
the  size  of  the  candidate.  This  is  calculated  using  each  node  in  H  as  a  starting  point,  and 
the  largest  candidate  is  returned  by  ChooseClusteringClique. 

The  next  steps  in  RobustBuildPureClusters  are  basically  the  FindPattern  of 
Table  1  with  a  final  purification.  The  main  difference  is  that  we  do  not  check  anymore  if 
pairs  of  nodes  in  the  initial  clustering  given  by  Selection  should  be  separated.  The  intuition 
explaining  the  usefulness  of  this  implementation  is  as  follows:  if  there  is  a  group  of  latents 
forming  a  pure  subgraph  of  the  true  graph  with  a  large  number  of  pure  indicators  for  each 
latent,  then  the  initial  step  should  identify  such  group.  The  consecutive  steps  will  refine  this 
solution  without  the  risk  of  splitting  the  large  clusters  of  variables,  which  are  exactly  the  ones 
most  likely  to  produce  false  positive  decisions  with  constraint  sets  {CS1,  CS2,  CS3}.  Ro¬ 
bustBuildPureClusters  has  the  power  of  identifying  the  latents  with  large  sets  of  pure 
indicators  and  refining  this  solution  with  more  flexible  rules,  therefore  generating  the  smaller 
clusters.  The  function  ChooseClustering  is  identical  to  ChooseClusteringClique, 
but  now  we  do  not  worry  about  which  pairs  of  nodes  from  our  new  H  are  linked. 

Notice  that  FindInitialSelection  is  very  similar  to  the  FindMeasurementPat- 
TERN  algorithm  of  Silva  et  al.  (2003).  An  essential  difference  is  that  we  are  not  concerned 
about  finding  a  pure  model  with  three  indicators  per  latent:  for  instance,  it  might  be  the 
case  that  one  of  the  latents  chosen  before  purification  will  be  discarded  if  all  of  its  measures 
were  removed  by  the  RobustPurify  algorithm. 

To  give  an  idea  of  how  the  later  steps  of  refinement  are  essential  for  the  sucess  of  Ro¬ 
bustBuildPureClusters,  we  ran  some  simulations  with  models  that  according  to  the 
experiments  analyzed  in  Silva  et  al.  (2003)  were  the  most  challenging  for  FlNDMEASURE- 
MENTPATTERN:  models  where  the  largest  pure  subgraph  of  the  true  graph  has  exactly  three 
pure  indicators  per  latent.  We  generated  20  different  data  sets  with  1,000  instances,  each 
one  sampled  from  a  different  random  parameterization4  of  a  pure  measurement  model  with 
a  fully  connected  latent  structure,  5  latents  and  3  indicators  per  latent.  We  got  an  aver- 

4In  the  next  section,  we  explain  how  we  generate  parameters  for  our  simulated  models. 
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Algorithm  RobustBuildPureClusters 

Input:  E,  a  sample  covariance  matrix  of  a  set  of  variables  O 


1.  (Selection,  C,C0)  ^FindInitialSelection(E). 

2.  For  every  pair  of  nonadjacent  nodes  {Ni,N2}  in  C  where  at  least  one  of  them  is  not 
in  Selection  and  an  edge  Ni  —  N2  exists  in  C0,  add  a  RED  edge  N\  —  N2  to  C. 

3.  For  every  pair  of  nodes  linked  by  a  RED  edge  in  C ,  apply  successively  rules  CS1,  CS2 
(and  CS3,  if  wanted).  Remove  an  edge  between  every  pair  corresponding  to  a  rule  that 
holds.  Stop  when  it  is  not  possible  to  apply  any  rule  or  till  we  run  out  of  time. 

4.  Let  H  be  a  graph  where  each  node  corresponds  to  a  maximal  clique  in  C.  Make  H  a 
complete  graph. 

5.  Finalclustering  <—  ChooseClustering(H). 

6.  Return  ROBUSTPlJRlFY(FinalClustering,  C,  E). 

Table  3:  A  modified  BuildPureClusters  algorithm  that  starts  from  an  initial  pure  model 
and  ends  with  another  purification.  See  the  text  for  the  definition  of  ChooseClustering 
and  the  next  tables  for  the  definition  of  the  other  functions. 

age  number  of  1.89  latents  missing  with  FindMeasurementPattern  (standard  deviation 
of  0.87),  where  “missing  latents”  are  counted  as  follows:  there  are  none  of  its  indicators 
(known  from  the  simulated  graph)  in  the  outcome;  or  there  is  one  indicator,  but  it  is  clus¬ 
tered  with  indicators  of  other  latents.  In  contrast,  we  got  an  average  of  0.4  missing  latents 
with  RobustBuildPureClusters  (standard  deviation  of  0.6). 

FindMeasurementPattern  got  an  average  number  of  0.37  indicators  misplaced  in  a 
wrong  cluster,  where  “misplaced  indicators”  are  counted  as  follows:  for  a  given  cluster  in 
the  outcome  of  the  algorithm,  the  misplaced  indicator  is  the  only  one  from  a  different  true 
cluster5.  RobustBuildPureClusters  got  an  average  of  0.1.  Finally,  FindMeasure¬ 
mentPattern  got  an  average  number  of  6  missing  indicators  with  respect  to  the  maximum 
possible  pure  graph  (standard  deviation  of  3),  which  has  all  15  indicators.  RobustBuild- 
PureClusters  got  a  much  smaller  average  of  2.85  (standard  deviation  of  2.41)6. 

In  contrast,  given  data  generated  from  pure  models  with  5  indicators  per  latent,  FIND¬ 
MEASUREMENTPATTERN  almost  always  get  the  correct  number  of  clusters  (see  experiments 
in  Silva  et  ah,  2003).  However,  running  RobustBuildPureClusters  without  FindIni- 
tialSelection  resulted  in  an  average  of  1.3  clusters  that  were  split  (in  half)  with  a  high 
standard  deviation  of  1.03,  indicating  that  it  was  not  unlikely  that  in  some  runs  of  this 
experiment  we  got  3  true  clusters  that  were  split  in  half.  Only  in  25%  of  these  20  trials  we 
got  a  perfect  number  of  clusters.  Therefore,  FindInitialSelection  can  be  of  great  value. 

5In  only  one  case,  we  got  two  indicators  from  one  cluster  grouped  together  with  two  indicators  from 
another  true  cluster  in  the  same  outcome  cluster.  This  happened  in  both  algorithms. 

6Notice  that  such  deviations  are  high  because  of  the  small  number  of  trials  and  variables 
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Algorithm  FindInitialSelection 

Input:  E,  a  sample  covariance  matrix  of  a  set  of  variables  O 

1.  Start  with  a  complete  graph  C  over  O. 

2.  Remove  edges  of  pairs  that  are  marginally  uncorrelated  or  uncorrelated  conditioned  on 
a  third  variable. 

3.  C0  <-  C. 

4.  Color  every  edge  of  C  as  BLUE. 

5.  For  all  edges  Ni  —  N2  in  C,  if  there  is  no  other  pair  {A3,  A4}  such  that  all  three  tetrads 
constraints  hold  in  the  covariance  matrix  of  {A4,  A2,  A3,  A4},  change  the  color  of  the 
edge  A,  -  N2  to  GRAY. 

6.  For  all  pairs  of  variables  {Ad,  A2}  linked  by  a  BLLTE  edge  in  C 

If  there  exists  a  pair  {A3,  A4}  that  forms  a  BLLTE  clique  with  N\  in  C,  and  a 
pair  {A5,A6}  that  forms  a  BLLTE  clique  with  N2  in  C,  all  six  nodes  form  a  clique 
in  C0  and  DisjointGroup(A!,  A3,  A4,  A2,  A5,  A6;  E)  =  true ,  then  remove  all  edges 
linking  elements  in  {A1;  A3,  A4}  to  {A2,  A5,  A6}. 

Otherwise,  if  there  is  no  node  A3  that  forms  a  BLUE  clique  with  {Ai,A2}  in 
C,  and  no  BLUE  clique  in  {A4,As,A6}  such  that  all  six  nodes  form  a  clique  in  Co 
and  DisjointGroup(Ai,  A2,  A3,  A4,  A5,  A6;  E)  =  true ,  then  change  the  color  of  the 
edge  A4  -  A2  to  YELLOW. 

7.  Remove  all  GRAY  and  YELLOW  edges  from  C. 

8.  Listc  ^FindMaximalCliques(C). 

9.  Let  A  be  a  graph  where  each  node  corresponds  to  an  element  of  Listc  and  with  no 
edges.  Let  M,  denote  both  a  node  in  H  and  the  respective  set  of  nodes  in  Listc- 

10.  Add  an  edge  M4  —  M2  to  H  only  if  there  exists  {A4,  A2,  A3}  C  M\  and  {A4,  A5,  A6}  C 
M2  such  that  DlSJOINTGROUP(Ai,  A2,  A3,  A4,  A5,  A6;  E)  =  true. 

11.  Hchoice  ^ChooseClusteringClique(A). 

12.  Let  Hciusters  be  the  corresponding  set  of  clusters,  i.e.,  the  set  of  sets  of  observed  vari¬ 
ables,  where  each  set  in  Hdusters  correspond  to  some  Mt  in  Hchoice. 

13.  Selection  ^-RobustPurify (Hdusters,  C,  E). 

14.  Return  (Selection,  C,  Co). 


Table  4:  Selects  an  initial  pure  model. 


Notice  that  the  order  by  which  tests  are  applied  might  influence  the  outcome  of  FlN- 
dInitialSelection,  since  if  we  remove  an  edge  X  —  Y  in  C  at  some  point,  then  we  are 
excluding  the  possibility  of  using  some  tests  where  A"  and  Y  are  required  (e.g.,  when  search¬ 
ing  to  separate  P  and  Q,  we  will  not  consider  DisjointGroup(P,  X,  Y,  Q,  _,  _),  for  instance). 
Imposing  such  restriction  reduces  the  overall  computational  cost  and  also  reduces  the  num¬ 
ber  of  statistical  tests  that  are  performed.  Consequently,  the  number  of  statistical  mistakes 
is  also  reduced.  To  minimize  the  ordering  effect,  an  option  is  to  run  the  algorithm  multi¬ 
ple  times  and  select  the  output  with  the  highest  number  of  nodes.  The  more  different  is 
the  true  model  from  a  pure  model,  the  more  variety  will  be  observed  among  different  runs. 
Purification  also  introduces  variability:  if  two  variables  are  linked  to  the  same  number  of 
impurities,  we  remove  the  first  one  according  to  the  ordering  given.  In  our  experiments, 
we  actually  do  not  avoid  tests  if  the  required  BLUE  cliques  do  not  exist  as  proposed  by 
Step  6  of  FindInitialSelection  (with  the  exception  of  those  that  resulted  from  vanishing 
correlations,  since  they  introduce  undesirable  vanishing  tetrads).  This  reduces  the  effect 
of  variability,  but  different  choices  of  ordering  of  variables  will  in  many  cases  still  result  in 
different  clusterings  if  the  number  of  variables  is  high.  That  happens  because  the  greedy 
ChooseClustering/ChooseClusteringClique  algorithms  visit  many  states  of  equal 
value  during  search,  and  in  our  implementation  a  choice  is  made  based  on  which  maximal 
clique  was  generated  first.  Since  the  order  of  cliques  that  is  generated  is  a  function  of  a 
random  order  of  nodes  in  each  run,  we  get  variations  of  the  result  among  runs. 

For  instance,  in  our  simulation  studies  reported  in  the  next  section,  where  synthetic 
models  have  relatively  large  pure  submodels,  there  is  virtually  no  ordering  effect  in  the 
output.  On  the  other  hand,  with  the  real- world  cases,  there  is  a  clear  variation  of  output 
with  respect  to  the  chosen  order  of  variables.  However,  multiple  runs  can  actually  increase 
the  insight  given  by  pure  models,  as  illustrated  in  Section  7.2.  We  will  hardly  ever  have  a  pure 
model  with  all  variables,  but  by  showing  multiple  pure  models  over  different  sets  of  variables, 
one  can  still  have  a  clear  picture  of  the  generative  process.  Also,  in  the  future  we  might  want 
to  explore  the  effect  of  avoiding  tests  as  defined  by  Step  6  of  FindInitialSelection. 

Finally,  we  define  RobustPurify  as  in  Table  5.  After  the  first  two  steps,  clusters  do 
not  overlap  and  according  to  our  constraint  rules  no  two  elements  in  different  sets  can  share 
a  common  parent  in  the  true  latent  variable  graph  (BLUE  or  RED  edge  in  C )  or  they  cannot 
be  in  a  pure  subgraph  (GRAY  edge  in  C ). 

Structural  EM  is  applied  as  an  heuristic  for  identifying  impurities.  Notice  the  use  of 
bidirected  edges,  which  corresponds  to  freeing  the  correlation  of  the  error  terms  of  two 
observed  nodes,  as  an  alternative  to  add  new  independent  Intents.  Adding  a  new  latent  would 
require  recomputing  the  required  expected  values  and  therefore  wasting  computational  time. 
We  stress  that  in  general  the  BIG  score  it  is  not  going  to  give  the  same  result  for  different 
graphs  in  the  same  tetrad  equivalence  class.  The  goal  is  to  throw  away  indicators  in  the 
purification,  a  much  more  modest  goal  than  claiming  that  extra  edges  among  indicators 
can  be  identified.  Therefore,  we  claim  that  heuristics  for  purification  have  a  particular 
pratical  use  in  this  context.  Althought  there  is  no  theoretical  guarantee  that  Structural 
EM  will  converge  to  the  global  optimum  of  all  DAGs,  nor  that  greedy  heuristic  search  with 
a  BIC  score  provides  a  consistent  penalization  for  complexity,  we  find  this  heuristic  to  be 
very  useful  in  practice  and  consistent  with  some  of  the  results  in  Elidan  and  Friedman 
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Algorithm  RobustPurify 

Inputs:  Clusters,  a  set  of  subsets  of  some  set  O; 

C,  an  undirect  graph  over  O; 

S,  a  sample  covariance  matrix  of  O. 

1.  Remove  all  nodes  that  have  appear  in  more  than  one  set  in  Clusters. 

2.  For  all  pairs  of  nodes  that  belong  to  two  different  sets  in  Clusters  and  are  adjacent 
in  C,  remove  the  one  from  Clusters  that  belong  to  the  largest  set  unless  the  smallest 
one  has  less  than  three  elements. 

3.  Let  G  a  graph  with  a  latent  corresponding  to  each  nonempty  set  in  Clusters.  Add 
all  nodes  in  Clusters  as  observed  nodes  in  G.  For  each  set  S  €  Clusters,  add  a  new 
latent  as  the  only  common  parent  of  all  nodes  in  S.  Choose  an  arbitrary  ordering  of 
latents  and  according  to  that  ordering  create  a  fully  connected  DAG  over  the  latents. 

4.  Apply  Structural  EM  to  ( G ,  E)  using  the  Gaussian  BIC  as  a  score  function,  and  some 
hill-climbing  algorithm  with  operators  as  follows:  adding  a  directed  edge  from  an 
observed  node  to  another  in  the  same  cluster  as  long  as  it  does  not  create  cycles; 
adding  a  bidirected  edge  between  two  observed  nodes  (in  the  same  cluster  or  not)  as 
long  as  there  is  no  directed  path  between  these  nodes;  removing  edges  between  observed 
nodes. 

5.  Let  Ord.  be  a  list  of  the  observed  nodes  in  G  in  a  decreasing  order  of  the  number  of 
non-latent  adjacencies  they  have. 

6.  Sequentially  remove  elements  from  G  according  to  the  order  given  by  Ord  till  no 
observed  node  has  an  adjacency  besides  its  unique  latent  parent. 

7.  Remove  any  latents  without  observed  children. 

8.  Return  G. 


Table  5:  A  score-based  purification. 
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(2001)  which  illustrate  that,  given  a  starting  point  close  to  the  true  graph,  heuristic  hill¬ 
climbing  will  provide  an  estimate  graph  reasonably  close  to  the  true  graph.  One  can  therefore 
also  interpret  the  tetrad  constraint  search  that  initializes  the  Structural  EM  module  as  a 
principled  approach  to  find  a  good  starting  point  that  is  able  to  converge  to  a  pure  subgraph 
of  the  original  network.  In  the  next  section  we  evaluate  how  good  this  procedure  is. 

Finally,  we  apply  the  heuristic  of  removing  nodes  iteratively  according  to  the  number 
of  impurities  related  to  each  of  them.  Trying  to  achieve  some  kind  of  optimality  such  as 
maximizing  the  number  of  pure  nodes  or  requiring  at  least  k  indicators  per  latent  would  result 
in  a  very  expensive  combinatorial  optimization  problem.  For  instance,  even  the  problem  of 
finding  a  purification  of  a  given  graph  that  includes  the  maximum  number  of  latents  can  be 
shown  to  be  hard. 

Proposition  3  Let  G  be  a  latent  variable  graph.  Then,  finding  a  purified  subgraph  of  G 
with  the  maximum  number  of  latents  where  each  latent  has  at  least  one  indicator  is  NP-hard. 

Proof:  Reduction  to  MAX  CLIQUE.  Let  G'  be  equal  to  G  but  where  observed  nodes  that 
have  more  than  one  latent  parent  are  removed.  Create  a  graph  H  with  the  observed  nodes 
of  G'.  Add  an  edge  for  every  pair  of  nodes  that  do  not  share  a  common  latent  parent  and 
are  d-separated  in  G'  given  the  latents.  Then  finding  a  maximum  clique  in  H  is  equivalent 
to  find  a  pure  subgraph  of  G  with  the  maximum  possible  number  of  latents,  each  latent  with 
at  least  one  indicator.  □ 


7  Empirical  results 

Evaluating  automated  knowledge  discovery  algorithms  is  often  a  difficult  task  because  of  the 
lack  of  a  readily  available  gold  standard  by  which  comparisons  could  be  made.  This  is  espe¬ 
cially  true  for  unsupervised  learning  techniques  such  as  clustering  and  causality  discovery. 
However,  we  can  still  compare  the  outcome  of  our  algorithm  to  theoretical  models  designed 
by  experts  in  a  field  of  interest,  althought  the  models  themselves  might  not  be  perfect. 

Another  approach  we  take  to  evaluate  our  algorithm  is  by  sampling  synthetic  data  from 
simulated  models.  By  knowing  the  true  underlying  structure,  and  we  can  come  up  with 
objective  measures  of  success.  Also,  it  is  possible  to  perform  sensitivity  analysis  of  our  model 
with  respect  to  distributional  assumptions:  in  the  next  subsection,  we  will  also  evaluate  how 
the  score-based  purification  is  sensitive  to  non-gaussian  distributions.  The  second  part  of 
our  empirical  evaluation  concerns  rebuilding  the  measurement  model  of  three  real- world  data 
sets  according  to  theoretical  models. 

7.1  Synthetic  data 

The  data  sets  we  use  in  this  section  are  synthetic  data  sets.  The  importance  of  synthetic 
data  is  the  fact  that  we  know  which  is  the  true  model  that  generated  the  given  samples,  and 
therefore  we  can  calculate  precisely  some  measures  of  distance  from  our  induced  models  to 
the  true  structure.  We  will  evaluate  the  following  features  for  each  pure  model  we  get  with 
respect  to  a  purified  true  graph: 
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Figure  4:  In  (a),  a  pure  model  with  2  latents  and  three  indicators  per  latent.  In  (b),  a  type 
of  impurity  model  of  2  latents  where  2  observed  variables  per  each  latent  are  children  of 
multiple  latents.  The  model  in  (c)  is  an  example  of  model  with  three  latents  with  a  chain 
that  turns  the  first  and  last  indicators  of  each  latent  impure. 

•  proportion  of  missing  latents  (ML),  the  number  of  latents  in  the  true  graph  that 
do  not  appear  in  the  estimated  pure  graph,  divided  by  the  number  of  latents  in  the 
true  graph; 

•  proportion  of  missing  indicators  (MI),  the  number  of  indicators  in  the  true  puri¬ 
fied  graph  that  do  not  appear  in  the  estimated  pure  graph,  divided  by  the  number  of 
indicators  in  the  true  purified  graph; 

•  proportion  of  misplaced  indicators  (Mpl),  the  number  of  indicators  in  the  es¬ 
timated  pure  graph  that  end  up  in  the  the  wrong  cluster,  divided  by  the  number  of 
indicators  in  the  estimated  pure  graph; 

•  proportion  of  impurities  (Im),  the  number  of  impurities  in  the  estimated  pure 
graph  divided  by  the  number  of  impurities  in  the  true  (non-purified)  graph; 

•  proportion  of  splits  (Sp),  the  number  of  clusters  in  estimated  pure  graph  that  were 
split  in  more  than  one  cluster,  divided  by  the  total  number  of  clusters  in  the  true 
graph. 

To  perform  the  comparison,  we  should  indicate  which  latent  found  in  the  estimation 
corresponds  to  which  of  the  original  latents.  The  straightforward  way  is  making  the  match 
according  to  the  original  parent  of  the  majority  of  the  indicators  in  a  given  estimated  cluster: 
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for  example,  suppose  we  have  an  estimated  latent  LE.  If,  for  instance,  70%  of  the  measures 
in  LE  are  measures  of  the  true  latent  L2l  we  label  LE  as  L2  in  the  estimated  graph  and 
calculate  the  statistics  of  comparison  as  described  above.  Ties  are  broken  arbitrarily. 

For  the  following  results,  we  generated  only  multivariate  normal  indicators,  with  requires 
a  linear  latent  structure.  Samples  were  generated  using  the  Tetrad  IV  program  ' .  Values  for 
the  cofficients  are  then  uniformly  sampled  from  the  interval  [—1.5,  — 0.5] U [0.5, 1.5].  Variances 
for  the  exogenous  nodes  (i.e.,  latents  without  parents  and  error  nodes)  arc  uniformly  sampled 
from  the  interval  [1,3].  The  motivation  for  choosing  such  intervals  is  generating  artificial 
models  where  the  causal  effects  are  not  too  big  or  too  small.  After  the  full  parameterized 
model  is  set,  independent  samples  are  pseudorandomly  generated.  The  pseudorandom  num¬ 
ber  generator  used  in  the  following  experiments  was  the  one  used  in  the  Java  1.4  virtual 
machine.  The  p-value  used  in  all  tests  for  all  experiments  was  0.05  for  FindInitialSelec- 
tion  and  reduced  to  0.02  elsewhere,  since  in  our  simulations  rules  CS1,  CS2,  CS3  have  a 
tendency  to  hre  erroneously  because  parts  of  the  rules  concerning  tetrads  constraints  that 
should  not  hold  (e.g.,  ctx1x2^y1y2  %  cx1Y2crx2Y1  hr  CS1)  are  accepted  as  such  when  the  null 
hypothesis  is  actually  true  (i.e.,  (7x1x2(7y1y2  =  crXiy2crx2Yi)- 

We  generate  four  types  of  models:  pure  models  with  three  indicators  per  latent  (Pure-3) 
as  illustrated  by  Figure  4(a);  pure  models  with  five  indicators  per  latent  (Pure-5);  models 
with  three  pure  indicators  per  latent  plus  two  observed  variables  per  latent  that  are  shared 
indicators  (SI)  of  every  latent  (Pure-3  +  SI)  as  illustrated  by  Figure  4(b);  models  with  five 
indicators  per  latent,  three  of  which  are  pure,  and  the  other  two  are  linked  by  a  directed  edge 
(Pure-3  +  Chain).  Also,  the  last  indicator  of  each  cluster  is  a  parent  of  the  first  indicator 
of  the  consecutive  cluster,  as  illustrated  by  Figure  4(c).  In  this  way,  every  latent  will  have 
only  three  pure  indicators,  except  the  first  latent  in  the  chain,  which  will  have  four  pure 
indicators. 

Simulation  results  are  given  in  Table  6.  Each  result  is  an  average  over  20  experiments 
with  different  parameter  values  randomly  selected  for  each  instance  and  three  different  sample 
sizes.  There  was  a  sensible  improvement  from  trials  based  on  samples  of  size  1000  compared 
to  those  with  samples  of  size  200,  but  little  difference  was  observed  when  comparing  trials 
of  sample  size  1000  to  those  with  sample  size  10000.  There  was  a  tendency  to  remove  more 
indicators  than  necessary  in  the  purification  procedure  (i.e.,  high  MI  index).  We  conjecture 
that  it  can  be  a  result  of  using  BIC  as  a  score  function:  notice  that  this  phenomenon 
was  less  extreme  with  pure  models.  One  can  verify,  at  least  empirically,  that  the  Jacobian 
matrix  of  the  parameters  of  the  network  with  respect  to  the  joint  parameters  (i.e.,  the 
matrix  of  derivatives  of  the  entries  of  the  covariance  matrix  with  respect  to  coefficients  and 
error  variances)  has  full  rank  when  the  variances  of  the  latents  are  scaled  to  a  fixed  value. 
According  to  Geiger  et  al.  (1996),  under  this  condition  the  BIC  score  might  work  well.  A 
way  to  improve  our  results  might  be  through  adjusting  the  BIC  score  by  using  this  rank 
instead  of  the  number  of  parameters,  but  that  might  imply  extra  computational  cost  if  it 
is  not  possible  to  find  an  analytical  way  of  computing  such  rank.  An  alternative  is  running 
an  iterated  ht-and-purify  procedure:  after  hill-climbing  is  done,  remove  only  one  variable. 
Repeat  the  process  from  scratch.  In  this  way,  the  purification  is  less  sensitive  to  the  numerous 
edges  that  might  have  been  added  without  necessity.  However,  the  computational  cost  is 

'Available  at  http://www.phil.cmu.edu/tetrad. 
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Evaluation  of  estimated  purified  models 

ML 

MI 

Mpl 

Im 

Sp 

Pure-3 

sample  size  200 
sample  size  1000 
sample  size  10000 

0.16  ±0.14 

0.04  ±  0.07 
0.04  ±  0.06 

0.22  ±0.08 

0.10  ±0.07 

0.08  ±  0.08 

0.05  ±  0.07 

0.0  ±0.0 

0.0  ±0.0 

— 

0.0  ±0.0 

0.0  ±0.0 

0.0  ±0.0 

Pure-5 

sample  size  200 
sample  size  1000 
sample  size  10000 

0.02  ±  0.05 

0.02  ±0.07 

0.0  ±0.0 

0.25  ±0.08 

0.15  ±0.09 

0.03  ±0.03 

0.0  ±0.0 

0.01  ±0.02 

0.0  ±0.0 

— 

0.06  ±  0.07 

0.02  ±0.05 

0.0  ±0.0 

Pure-3  ±  SI 

sample  size  200 
sample  size  1000 
sample  size  10000 

0.09  ±0.14 

0.05  ±  0.07 
0.07  ±0.09 

0.25  ±0.11 

0.14  ±0.11 

0.13  ±0.10 

0.02  ±  0.04 

0.0  ±0.0 

0.0  ±0.0 

0.15  ±0.10 

0.12  ±0.07 

0.08  ±  0.08 

0.01  ±0.03 

0.0  ±0.0 

0.0  ±0.0 

Pure-3  ±  chain 

sample  size  200 
sample  size  1000 
sample  size  10000 

0.14  ±0.13 

0.02  ±  0.05 

0.04  ±  0.08 

0.28  ±0.11 

0.11  ±0.06 

0.12  ±0.12 

0.02  ±  0.04 

0.0  ±0.0 

0.0  ±0.0 

0.21  ±0.12 

0.04  ±  0.06 

0.05  ±0.05 

0.02  ±  0.05 

0.02  ±  0.05 

0.02  ±0.05 

Table  6:  Results  obtained  for  estimated  purified  graphs.  Each  number  is  an  average  over  20 
trials,  with  an  indication  of  the  standard  deviation  over  these  trials. 


also  largely  increased.  In  a  future,  we  may  try  to  adopt  similar  strategies. 

We  also  ran  experiments  to  detect  how  sensible  RobustBuildPureClusters  might 
be  when  the  normality  assumption  is  violated.  Using  the  same  causal  structure  from  the 
Pure-3  +  SI  graph  and  multivariate  Gaussian  parameterization,  we  generated  Gaussian  data 
with  additional  random  noise  sampled  from  a  mixture  of  two  betas  independently  added  to 
each  observed  variable.  The  mixture  was  defined  randomly  for  each  data  set  by  sampling 
the  four  betas  parameters  from  a  uniform  [0, 10]  distribution,  and  the  mixture  proportion 
from  a  uniform  [0, 1].  We  also  multiplied  the  noise  by  3,  and  generated  15  data  sets  where 
the  average  proportion  of  variance  for  each  variable  increased  by  at  least  30%  after  adding 
noise. 

The  results  were  as  follows  for  a  sample  size  of  1000:  an  average  of  0.07  missing  latents 
(standard  deviation:  0.09),  0.24  missing  indicators  (deviation  of  0.10),  0.02  misplaced  indica¬ 
tors  (0.03),  0.07  impurities  (.07)  and  0.009  clusters  that  were  split  (for  only  one  cluster  in  one 
of  the  15  trials).  There  was  a  significant  increase  of  missing  indicators  compared  to  the  case 
with  no  non-Gaussian  noise,  but  the  algorithm  still  demonstrated  a  robust  behavior  against 
deviations  from  normality  according  to  the  other  criteria.  This  is  not  surprising,  considering 
the  relative  robustness  of  linear  models  against  wrong  distributional  assumptions.  However, 
a  more  extensive  sensitive  analysis  still  needs  to  be  done  in  the  future. 
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7.2  Real-world  applications 

We  now  discuss  results  obtained  in  three  different  data  sets  in  social  sciences.  Even  though 
data  collected  from  social  questionnaires  may  pose  significant  problems  for  exploratory  data 
analysis  since  sample  sizes  are  usually  small  and  noisy,  nevertheless  they  have  a  very  useful 
property  for  our  empirical  evaluation  purposes:  questionnaires  are  designed  to  target  specific 
latent  factors  (such  as  “stress” ,  “job  satisfaction” ,  and  so  on)  and  a  theoretical  measurement 
model  is  developed  by  experts  in  the  area  to  measure  the  desired  latent  variables,  thus 
providing  a  basis  for  comparison  with  the  output  of  our  algorithm.  Such  variables  usually 
include  dozens  of  different  indicators,  although  the  chance  that  various  observed  variables 
are  not  pure  measures  of  their  theoretical  latents  is  high.  Indicators  are  usually  discrete,  but 
ordered  in  a  Likert  scale  (Bollcn,  1989)  such  as  {“strongly  disagree”,  “relatively  disagree”, 
“indifferent”,  “relatively  agree”,  “strongly  agree”}.  We  will  treat  them  as  continuous  vari¬ 
ables. 

Since  there  are  theoretical  models,  it  is  easier  to  evaluate  how  our  algorithm  performs. 
The  evaluation  performed  in  the  following  three  data  sets  will  basically  contrast  the  quali¬ 
tative  models  obtained  from  our  tetrad  analysis  against  the  theoretical  models  specified  by 
previous  empirical  research.  As  an  additional  comment,  since  sample  sizes  are  small,  such 
data  sets  could  hardly  be  reliably  analysed  by  full  score-based  hill-climbing  algorithms,  since 
the  number  of  parameters  would  by  far  exceed  the  number  of  data  points.  When  our  proce¬ 
dure  invokes  the  score-based  purification,  the  number  of  parameters  is  already  dramatically 
reduced. 

Student  anxiety  factors.  A  survey  of  test  anxiety  indicators  was  administered  to  335  grade 
12  male  students  in  British  Columbia  (Bartholomew  et  ah,  2002).  The  survey  consisted  in 
20  measures  on  symptoms  of  anxiety  under  test  conditions.  A  brief  description  of  the  20 
indicators  is  shown  in  Table  7. 

Using  factor  analysis,  Bartholomew  et  al.  concluded  that  two  factors  would  be  the  best 
choice  for  this  data  set  throught  a  scree  plot.  If  we  perform  a  chi-square  test  of  statistical 
fitness  using  the  given  covariance  matrix,  the  factor  analysis  implementation  in  SAS  reveals 
that  just  one  factor  is  enough  with  a  p-value  of  0.09.  This  is  also  the  result  that  minimizes 
BIC.  Bartholomew  et  al.  favor  a  better  account  of  the  variation  in  this  data  by  using  a  more 
complex  model. 

According  to  Bartholomew  et  ah,  this  inventory  has  been  used  in  many  countries  with 
similar  results.  The  original  study  identified  items  {x2,  x$,  xg,  aqo,  £15,  xi§,  ^is}  as  indicators 
of  an  “emotionality”  latent  factor  (this  includes  physiological  symptoms  such  as  jittery  and 
faster  heart  beatting),  and  items  {£3,  X4,  x$,  X6,  x7,  Xu,  xn,  £20}  as  indicators  of  a  more  psy¬ 
chological  type  of  anxiety  labeled  “worry”  by  Bartholomew  et  al.  No  further  description  is 
given  about  the  remaining  five  variables.  Bartholomew  et  al.’s  factor  analysis  with  oblique 
rotation  roughly  matches  this  model. 

We  ran  our  algorithm  10  times  with  different  random  orderings  of  variables  and  we  got 
always  the  same  following  measurement  model  ( Xi  represents  the  ith  item  in  Table  7): 

1.  X2,  Xs,  Xg,  £10,  Xu,  Xi6,  Xi8 

2.  x3,x5,x7 
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1.  Lack  of  confidence  during  tests 

2.  Uneasy,  upset  feeling 

3.  Thinking  about  grades 

4.  Freeze  up 

5.  Thinking  about  getting  through  school 

6.  The  harder  I  work,  the  more  confused  I  get 

7.  Thought  interfere  with  concentration 

8.  Jittery  when  taking  tests 

9.  Even  when  prepared,  get  nervous 

10.  Uneasy  before  getting  the  test  back 

11.  Tense  during  test 

12.  Exams  bother  me 

13.  Tense/stomach  upset 

14.  Defeat  myself  during  tests 

15.  Panicky  during  tests 

16.  Worry  before  important  tests 

17.  Think  about  failing 

18.  Heart  beating  fast  during  tests 

19.  Can’t  stop  worrying 

20.  Nervous  during  test,  forget  facts 

Table  7:  Indicators  of  test  anxiety  described  in  Bartholomew  et  al.  (2002). 
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3.  £6)^14 

Interestingly,  the  largest  cluster  closely  corresponds  to  the  “emotionality”  factor  as  de¬ 
scribed  by  previous  studies.  The  remaining  two  clusters  are  a  split  of  “worry”  into  two 
subclusters  with  some  of  the  original  variables  eliminated.  Variables  in  the  second  cluster 
are  only  questions  that  explicitly  describe  “thinking”  about  sucess/failure  (the  only  other 
question  in  the  survey  with  the  same  characteristic  was  xn  which  was  eliminated).  Variables 
xq  and  X14  are  can  be  interpreted  as  indicating  self-defeat. 

To  evaluate  how  the  model  given  by  Bartholomew  et  al.  compares  to  the  outcome  of 
our  algorithm,  we  will  compare  their  fits  according  to  the  usual  chi-square  test,  and  also 
evaluate  intermediate  models.  The  two-factor  model  given  by  all  theoretical  indicators  of 
“emotionality”  and  “worry”  does  not  fit  as  a  pure  model  (p- value  of  zero):  the  full  factor 
analysis  solution  will  require  that  some  of  the  indicators  have  significant  loadings  in  both 
Intents,  but  there  is  no  simple  principled  way  to  explain  why  such  loadings  are  necessary. 
They  may  be  due  to  direct  effects  of  one  variable  on  another,  or  due  to  other  latent  factors 
independent  of  the  two  conjectured.  Besides  that,  the  significance  of  such  coefficients  is  tied 
to  whatever  ad-hoc  rotation  method  is  employed  order  to  obtain  “simple  structure” . 

If  we  remove  variables  x^^xn  and  X20  from  Bartholomew  et  al.’s  model  because  they  are 
not  in  our  purified  model  and  fit  a  2-factor  purified  model  (i.e.,  equivalent  to  our  model 
after  merging  clusters  2  and  3  and  latents  are  always  fully  connected),  we  get  a  p- value  of 
0.11,  corresponding  to  a  chi-square  statistic  of  65.8  (53  degrees  of  freedom).  This  model 
itself  might  be  significant,  but  comparing  to  our  proposed  model  of  p- value  0.47  (chi-square 
of  51.2,  51  degrees  of  freedom),  the  difference  of  chi-squares  is  large  enough  such  that  the 
p-value  of  the  pure  two-factor  model,  using  as  alternative  hypothesis  our  model,  drops  to 
0.0007.  This  strongly  suggests  that  our  model  adds  a  significant  improvement  in  fit  to  the 
pure  two-factor  model  by  splitting  the  group  {a;3,  x5,  x6,  Xj,  aq4}  into  two.  In  contrast,  by 
randomly  partitioning  the  first  cluster  into  two,  we  did  not  get  any  significant  improvement 
(p-value  <  0.05)  in  5  trials.  To  summarize,  by  dropping  only  3  out  of  15  previously  clas¬ 
sified  variables  (among  a  total  of  20  variables),  our  algorithm  built  a  measurement  model 
not  only  much  simpler  to  understand,  but  also  giving  a  better  fit.  All  without  using  any 
domain-specific  prior  knowledge  and  without  relying  on  ad-hoc  definitions  of  “simplicity” 
such  as  the  ones  used  to  justify  factor  rotation. 

Well-being  and  spiritual  coping  Bongjae  Lee  from  the  University  of  Pittsburgh  organized 
a  study  to  investigate  religious/spiritual  coping  and  stress  in  graduate  students.  In  December 
of  2003,  127  Masters  in  Social  Works  students  answered  a  questionnaire  intendent  to  measure 
three  main  factors: 

•  stress,  measured  with  21  items,  each  using  a  7-point  scale  (from  “not  all  stressful”  to 
“extremely  stressful” )  according  to  situations  such  as:  “fulfilling  responsabilities  both 
at  home  and  at  school”;  “meeting  with  faculty”;  “writing  papers”;  “paying  monthly 
expenses” ;  “fear  of  failing” ;  “arranging  childcare” ; 

•  well-being,  measured  with  20  items,  each  using  a  4-point  scale  (from  “rarely  or  none” 
to  “most  or  all  the  time” )  according  to  indicators  as:  “my  appetite  was  poor” ;  “I  felt 
fearful”;  “I  enjoyed  life”  “I  felt  that  people  disliked  me”;  “my  sleep  was  restless”; 
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•  religious/spiritual  coping ,  measured  with  20  items,  each  using  a  4-point  scale  (from 
“not  at  all”  to  “a  great  deal”)  according  to  indicators  such  as:  “I  think  about  how 
my  life  is  part  of  a  larger  spiritual  force” ;  “I  look  to  God  (high  power)  for  strength  in 
crises” ;  “I  wonder  wheter  God  (high  power)  really  exists” ;  “I  pray  to  get  my  mind  off 
of  my  problems”; 

The  full  questionnaire  is  given  in  the  Appendix.  Theoretical  latents  are  not  necessarily 
unidimensional,  i.e. ,  they  might  be  partioned  into  an  unknown  set  of  sublatents  and  their 
indicators  might  be  impure,  but  there  was  no  prior  knowledge  about  which  impurities  might 
exist. 

The  goal  of  the  original  study  was  to  use  graphical  models  to  quantify  how  spiritual 
coping  moderates  the  association  of  stress  and  well-being.  Our  goal  in  this  analysis  is  to 
verify  if  we  get  a  clustering  consistent  with  the  theoretical  measurement  model  (i.e.,  questions 
related  to  different  topics  will  not  end  up  in  a  same  cluster),  and  analyse  how  questions  are 
partioned  within  each  theoretical  cluster  (i.e.,  how  a  group  of  questions  related  to  the  same 
theoretical  latent  ended  up  divided  in  different  subclusters)  using  no  prior  knowledge. 

The  algorithm  was  applied  10  times  with  a  different  random  choice  of  variable  ordering 
each  time.  On  average  we  got  18.2  indicators  (standard  deviation  of  1.8).  Clusters  with  only 
one  variable  were  excluded.  On  average,  5.5  latents  were  discovered  (standard  deviation 
of  0.85).  Counting  only  latents  with  at  least  three  indicators,  we  had  on  average  4  latents 
(standard  deviation  of  0.67).  In  comparison,  using  the  theoretical  model  as  an  initial  model 
and  by  applying  purification  directly  8,  i.e.  without  automated  clustering,  we  obtained  15 
variables  (8  indicators  of  stress,  4  indicators  of  coping  and  3  indicators  of  depression).  We 
should  not  expect  to  do  much  better  with  an  automated  clustering  method.  This  clustering 
is  given  below: 

1.  Clustering  CO  (p- value:  0.28): 

STR03,  STR04,  STR16,  STR18,  STR20 
DEP09,  DEP13,  DEP19 
COP09,  COP12,  COP14,  COP15 

By  comparing  each  result  to  the  theoretical  model  and  taking  the  proportion  of  indica¬ 
tors  that  were  clustered  differently  from  the  theoretical  model,  we  had  an  average  percentage 
of  0.05  (standard  deviation  of  0.05).  The  proportionally  high  standard  deviation  is  a  con¬ 
sequence  of  the  small  percentages:  in  4  out  of  10  cases  there  was  no  indicator  mistakenly 
clustered  with  respect  to  the  questionnaire,  in  5  out  of  10  we  had  only  one  mistake,  and  in 
only  one  case  there  were  two  mistakes. 

The  three  outputs  with  the  highest  number  of  indicators  (respectively,  21,  20,  20)  were 
also  the  ones  with  the  highest  number  of  latents: 

8In  order  to  save  time,  we  first  applied  a  constraint-based  purification  method  described  in  Spirtes  et  al. 
(2000)  as  a  first  step,  using  false  discovery  rates  as  a  method  for  controlling  to  multiple  hypothesis  tests. 
Due  to  relatively  large  number  of  variables,  this  method  is  quite  conservative  and  will  tend  to  underprune 
the  model,  and  therefore  should  not  compromise  the  subsequent  score-based  purification  that  was  applied. 
For  instance,  after  the  first  step  the  model  still  had  a  p-value  of  zero  according  to  a  chi-square  test. 
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1.  Clustering  Cl  (p-value:  0.31) 

STR05,  STR06,  STR08,  STR09 
STR12,  STR15,  STR21 
DEP06,  DEP08,  DEP17,  DEP18,  DEP20 
DEP15,  DEP19 

COP03,  COP04,  COP05,  COP11,  COP16 
COPIO,  COP13 

2.  Clustering  C2  (p-value:  0.80) 

STR06,  STR09,  STR10 
STR07,  STR15,  STR21 
DEP08,  DEP12 
DEP01,  DEP07,  COP06 
COP02,  COP03,  COP04,  COP11 
COP15,  COP16,  COP18 
STR17,  DEP36 

3.  Clustering  C3  (p-value:  0.52) 

STR05,  STR08,  STR09,  STR10 
STR12,  STR21 

DEP06,  DEP10,  DEP17,  DEP18,  DEP20 
DEP08,  DEP12,  DEP16 
COP03,  COP05,  C0P11,  COP18 
COPIO,  COP13 

P-values  are  obtained  from  a  chi-square  test  assuming  a  multivariate  Gaussian  distribu¬ 
tion.  Notice  that  variables  COP11  and  COP16  are  clustered  together  in  Cl,  while  they  are 
separated  in  C2.  The  reason  for  that  was  due  to  the  first  stage  of  clustering  used  in  our 
implementation,  where  we  look  for  clusters  of  size  at  least  three  based  on  a  more  stringent 
version  of  CS1.  In  the  case  of  Cl,  we  obtained  a  clustering  in  the  first  stage  where  COP11 
and  COP16  were  in  the  same  cluster  and,  therefore,  not  tested  again  in  the  second  stage. 
In  the  C2  run,  the  first  stage  did  not  include  this  cluster,  and  during  the  second  stage  there 
was  a  condition  by  which  COP  11  and  COP  16  were  separated.  Althought  in  principle  the 
purification  method  should  remove  one  of  these  two  indicators  in  Cl  if  they  were  not  meant 
to  be  clustered  together,  or  no  rule  should  separate  COP11  and  COP16  in  C2  if  they  were 
not  meant  to  be  separated,  with  small  sample  sizes  there  is  no  guarantee  of  a  reliable  choice. 
This  is  also  a  reason  why  it  is  useful  to  report  different  1-interpretations.  A  similar  situation 
happened  between  COP11  and  COP18. 

In  order  to  evaluate  how  the  split  of  theoretical  clusters  into  subclusters  was  helpful,  we 
evaluated  the  fit  of  models  Cl,  C2  and  C3  by  merging  subclusters  of  the  same  theoretical 
concept  into  single  ones,  one  at  a  time.  For  Cl,  all  three  submodels  have  p-values  less  than 
0.03.  For  C2,  we  first  removed  indicators  STR17,  DEP36  and  COP06  to  remove  the  effect  of 
having  a  theoretically  wrong  clustering.  The  resulting  p-value  is  roughly  the  same,  0.79.  We 
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then  merged  the  stress,  depression  and  coping  pairs  of  clusters,  one  pair  at  a  time.  Merging 
the  depression  indicators  result  in  a  model  of  p- value  0.09,  and  the  difference  of  chi-squares 
between  the  original  model  and  the  merged  indicators  model  is  not  significant  at  a  level 
10-6,  favoring  the  more  complex  model.  Merging  the  stress  indicators  results  in  a  model 
with  a  p- value  of  0.21,  and  the  difference  of  chi-square  statistics  has  a  p-value  not  significant 
at  a  level  10~2.  Merging  the  coping  indicators  results  in  a  model  with  a  p-value  0.73,  and 
the  difference  in  chi-squares  has  now  a  p-value  of  0.22,  providing  evidence  that  this  cluster 
might  have  been  spuriously  divided. 

When  looking  at  the  descriptions  of  items  {COP02,  COP03,  COP04,  COP11}  there  is 
actually  a  significant  degree  of  semantical  cohesion:  there  are  all  items  concerning  “fighting 
difficult  situations”.  Items  in  cluster  {COP15,  COP16,  COP18}  are  not  as  clearly  grouped, 
but  one  can  still  argue  that  among  all  items  given  in  the  questionnaire  they  are  the  ones 
more  directly  related  to  “possible  sources  of  advice”  in  a  more  general  sense.  Interestingly, 
the  former  cluster  can  then  be  seen  as  a  special  case  of  the  latter.  Anyway,  the  fact  that  in 
Cl  we  had  COP11  and  COP16  clustered  together,  and  in  C3  we  had  COP11  and  COP18 
together  provide  extra  evidence  that  these  clusters  might  have  been  better  interpreted  when 
merged. 

Concerning  merging  clusters  of  C3,  when  we  merge  the  stress  clusters  the  resulting  model 
has  a  p-value  of  0.006,  and  the  difference  of  chi-squares  highly  favours  the  more  complex 
model.  When  the  depression  clusters  are  merged,  the  new  p-value  is  0.004,  and  again  the 
more  complex  model  is  favoured.  Finally,  when  the  two  clusters  for  coping  are  merged,  the 
p-value  is  0.21,  but  the  difference  of  chi-squares  implies  a  p-value  of  only  0.002,  which  still 
indicates  lack  of  evidence  supporting  the  less  complex  model  compared  to  the  one  found  by 
our  procedure. 

In  conclusion,  by  analysing  the  models  obtained  from  the  automated  latent  discovery 
procedure,  one  can  verify  that  they  largely  match  theoretical  expectations  and,  more  than 
that,  are  slightly  more  comprehensive  than  the  purification  CO  obtained  by  using  the  origi¬ 
nal  questionnaire  as  a  starting  point.  Cl,  C2  and  C3  also  maintain  excellent  indices  of  fit, 
despite  their  larger  complexity  with  respect  to  CO. 

Single-mothers’  self-efficacy  and  children’s  development:  Jackson  and  Scheines  (2004) 
analysed  a  longitudinal  study  on  single  black  mothers  with  one  child  in  New  York  City  from 
1996  to  1999.  The  goal  of  the  study  was  to  detect  the  relationship  among  perceived  self- 
efficacy,  mothers’  employement,  maternal  parenting  and  child  outcomes.  Overall,  there  were 
nine  factors  used  in  this  study.  Three  of  them,  age,  education  and  income,  are  represented 
directly  by  one  indicator  each  (here  represented  as  W2moage,  W2moedu  and  W2faminc, 
respectively).  The  other  six  factors  are  latent  variables  measured  by  a  varied  number  of 
indicators: 

1.  financial  strain  (3  indicators,  represented  by  W2ffiianl,  W2foian2,  W2hnan3) 

2.  parenting  stress  (26  indicators,  represented  by  W2paroa  -  W2paroz) 

3.  emotional  support  from  family  (20  indicators,  represented  by  W2suf01  -  W2suf20) 

4.  emotional  support  from  friends  (20  indicators,  W2sufr01  -  S2sufr20) 
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5.  tangible  support  (i.e.,  more  material  than  psychological.  4  indicators,  W2ssupta  - 
W2ssuptd) 

6.  problem  behaviors  of  child  (30  indicators,  W2mnegl  -  W2mneg30) 

We  do  not  reproduce  the  original  questionnaire  here  due  to  its  size.  The  questionnaire  is 
based  on  previous  work  on  creating  scales  for  such  latents.  As  before,  we  evaluate  how  our 
algorithm  output  compares  to  the  theoretical  model.  The  extra  difficulty  here  is  that  the 
distribution  of  the  variables,  which  are  ordinal  categorical,  are  significantly  skewed.  Some 
of  the  categories  are  very  rare,  and  we  smoothed  the  original  levels  by  collapsing  values 
that  were  adjacent  and  represented  less  than  5%  of  the  total  total  number  of  cases.  Several 
variables  ended  up  binary  by  doing  this  transformation,  which  reduces  the  efficiency  of 
models  based  on  multivariate  Gaussian  distributions.  1  out  of  the  106  variables  was  also 
removed  (W2sufr04)  since  98%  of  the  points  fell  into  one  of  the  two  possible  categories.  The 
sample  size  is  178,  relatively  large  for  this  kind  of  study,  but  it  still  considerably  small  for 
exploratory  data  analysis. 

As  before,  the  algorithm  was  applied  10  times  with  a  different  random  choice  of  variable 
ordering  each  time.  On  average  we  got  21  indicators  (standard  deviation  of  3.35)  excluding 
clusters  with  only  one  variable.  On  average,  7.3  latents  were  discovered  (standard  deviation 
of  1.5).  Counting  only  latents  with  at  least  three  indicators,  we  had  on  average  4.3  latents 
(standard  deviation  of  0.86).  Moreover,  comparing  each  result  to  the  theoretical  model 
and  taking  the  proportion  of  indicators  that  were  wrongly  clustered,  we  had  an  average 
percentage  of  0.08,  with  standard  deviation  of  0.07. 

It  was  noticeable  that  the  small  theoretical  clusterings  (“financial  strain”  and  “tangi¬ 
ble  support”)  did  not  show  up  in  the  final  models,  but  we  claim  that  errors  of  omission 
are  less  harmful  than  those  of  comission,  i.e.,  wrong  clustering.  However,  it  was  relatively 
unexpected  that  the  clusterings  obtained  in  the  first  stage  of  our  implementation  (i.e.,  the 
output  of  FindInitialSelection)  were  larger  in  number  of  indicators  than  the  ones  ob¬ 
tained  at  the  end  of  process.  This  can  be  explained  by  the  fact  that  the  initial  step  is  a 
more  constrained  search,  and  therefore  less  prone  to  overfit.  Since  our  data  set  is  noisier 
than  in  the  previous  cases,  we  choose  to  evaluate  only  the  three  largest  clusters  obtained 
from  FindInitialSelection.  In  this  case,  we  had  an  average  proportion  of  0.037  wrongly 
clustered  items  (standard  deviation:  0.025),  4.9  clusters  (deviation:  0.33),  4.6  clusters  of  size 
at  least  three  (deviation:  0.71)  and  24.2  indicators  (deviation:  2.8).  Notice  that  the  clusters 
were  less  fragmented  than  in  the  previous  case,  i.e.,  we  had  less  clusters,  more  indicators  per 
clustering,  and  a  insignificant  number  of  clusters  with  less  than  three  indicators. 

The  largest  clusters  in  this  situations  were  the  following: 

1.  Cluster  D1  (p- value:  0.46): 

W2sufr02  W2sufr05  W2sufr08  W2sufrl3  W2sufrl4  W2sufrl9  W2sufr20 

W2mnegl4  W2mnegl5  W2mneg2  W2mneg22  W2mneg26  W2mneg28  W2mneg29 

W2suf01  W2suf05  W2suf08 

W2paro2e  W2paro2j  W2paro2t  W2paro2w 

W2suf07  W2sufl2  W2sufl7 
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2.  Cluster  D2  (p- value:  0.22): 

W2sufr01  W2sufr08  W2sufrl0  W2sufrl2  W2sufrl3  W2sufrl4  W2sufrl9  W2sufr20 

W2suf04  W2suf05  W2sufl0 

W2paro2e  W2paro2j  W2paro2t  W2paro2w 

W2paro2k  W2sufl2  W2sufl7 

W2mneg2  W2mneg5  W2mnegl2  W2mnegl4  W2mneg21  W2mneg22  W2mneg26 

3.  Cluster  D3  (p- value:  0.29): 

W2mneg2  W2mnegl0  W2mneg22  W2mneg26  W2mneg28  W2mneg29 

W2sufr01  W2sufr05  W2sufr08  W2sufr09  W2sufrl2  W2sufrl3  W2sufrl4  W2sufrl9 

W2suf02  W2suf04  W2suf05  W2sufll  W2sufl3  W2suf20 

W2paro2e  W2paro2j  W2paro2t  W2paro2w 

W2paro2k  W2sufl2  W2sufl7 

One  can  see  that  such  models  largely  agree  with  those  formed  from  prior  knowledge. 
However,  sucess  in  this  domain  is  not  as  interesting  as  in  the  previous  two  cases:  unlike  in 
the  test  anxiety  and  spiritual  coping  models,  the  covariance  matrix  of  the  latent  variables 
has  a  majority  number  of  very  small  entries,  resulting  in  a  considerably  easier  clustering  by 
just  observing  marginal  independencies  among  items. 

Still,  the  cases  where  theoretical  clusters  were  split  seem  to  be  in  accordance  with  the 
data:  merging  the  W2suf  indicators  in  a  single  pure  cluster  in  D1  will  result  in  a  model  with 
a  p- value  of  0.008.  Merging  the  W2suf  variables  in  D2  will  also  result  in  a  low  p-value  (0.06) 
even  when  W2paro2k  is  removed.  Unsurprinsingly,  doing  a  similar  merging  in  D3  gives  a 
model  with  a  p-value  of  0.04.  This  is  a  strong  indication  that  W2sufl2  and  W2sufl7  should 
form  a  cluster  on  their  own.  In  fact,  these  two  items  are  formulated  as  two  very  similar 
indicators:  “members  of  my  family  come  to  me  for  emotional  support”  and  “members  of  my 
family  seek  me  out  for  companionship”.  No  other  indicator  for  this  latent  seems  to  fall  in  the 
same  category.  Why  this  particular  pair  is  singled  out  in  comparison  with  other  indicators 
for  this  latent  is  a  question  for  future  studies  and  a  simple  example  of  how  our  procedure 
can  help  in  understanding  the  latent  structure  of  the  data. 

8  Discussion  and  future  work 

We  introduced  a  novel  method  for  automated  knowledge  discovery  based  on  causal  graphs 
with  latent  variables.  The  very  general,  relatively  weak,  assumptions  by  which  this  method 
has  theoretical  guarantees  are  made  explicit.  Although  there  are  situations  where  the  output 
of  our  algorithm  might  not  be  very  informative,  since  one  can  expect  that  only  a  subset  of  the 
available  variables  forms  a  pure  measurement  model,  this  can  also  be  seen  as  a  strength  of  the 
algorithm:  it  does  not  commit  itself  to  report  features  of  the  underlying  causal  model  that 
could  be  explained  by  different  mechanisms  under  the  given  set  of  assumptions.  Assumptions 
are  made  clear  instead  of  being  buried  in  apparent  but  deceiving  flexibility. 

Our  experiments  presented  evidence  that  such  framework  can  be  useful  in  practice,  but 
as  usual  there  are  many  directions  where  this  work  can  be  expanded: 
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•  dependency  on  parametric  assumptions:  the  tetrad  equivalence  class  and  nearly  all 
of  our  causal  assumptions  are  independent  of  assumptions  about  the  probability  dis¬ 
tribution  of  the  data.  However,  when  it  comes  down  to  do  tetrad  constraint  tests  or 
scoring  a  measurement  model  for  purification,  probabilistic  descriptions  of  the  data  are 
crucial.  So  far  we  have  restricted  ourselves  to  multivariate  Gaussian  distributions,  as 
usual  in  the  literature  of  graphical  models  with  continuous  variables.  In  principle,  there 
are  asymptotic  distribution-free  tests  of  tetrad  constraints  (Bollcn,  1990)  and  linear 
measurement  errors  are  known  to  be  relatively  robust  to  the  failure  of  the  normality 
assumption  (Fuller,  1987).  However,  there  might  be  more  statistically  efficient  ways  of 
weakening  distributional  assumptions.  This  is  also  a  problem  for  scoring  DAGs  as  used 
for  a  heuristic  purification.  More  flexible  approaches  for  measurement  models  such  as 
Carroll  et  al.  (1996)  could  be  explored  in  the  context  of  discovering  measurement 
model  structure; 

•  finding  robust  score  functions  that  will  give  the  same  score  only  for  models  in  the  same 
tetrad  equivalent  class.  The  goal  is  to  avoid  constraint-satisfaction  approaches  for 
learning  graphical  models  and  reduce  the  problem  to  hill-climbing  algorithms.  How¬ 
ever,  this  can  be  a  difficult  task  for  a  variety  of  reasons,  such  as  the  fact  that  multi¬ 
variate  Gaussian  latent  variable  models  are  not  curved  exponential  models  and  even 
approximations  for  them  can  be  potentially  very  difficult  to  compute  (Rusakov  and 
Geiger,  2004).  Also,  just  having  a  score  equivalence  class  corresponding  to  a  tetrad 
equivalence  class  is  not  enough  to  guarantee  a  theoretically  consistent  learning  proce¬ 
dure:  one  would  also  need  to  prove  that  some  non-trivial  search  algorithm  is  able  to 
find  the  best  scoring  model; 

•  better  treatment  of  discrete  variables:  although  we  hinted  how  discrete  variables  could 
be  integrated  in  a  tetrad  equivalence  class,  we  did  not  run  any  experiments  to  evaluate 
how  this  approach  performs.  Bartholomew  and  Knott  (1999)  survey  different  ways  of 
integrating  factor  analysis  and  discrete  variables  that  can  be  readily  adapted.  Two 
major  problems  affect  discrete  factor  analysis:  relying  on  underlying  Gaussian  random 
variables,  which  ties  the  structural  causal  assumptions  to  a  specific  probabilistic  model; 
the  computation  cost  of  performing  numerical  integrations.  So  far  no  empirical  studies 
have  been  performed  about  how  such  issues  might  affect  the  tetrad  equivalence  class 
here  described. 

•  study  applications  of  this  technique  for  multivariate  density  estimation.  Since  density 
estimation  in  high  dimensional  spaces  is  a  very  difficult  task,  one  could  try  a  more 
modest  goal  of  choosing  variables  that  can  be  represented  as  a  pure  measurement 
model  and  then  fit  such  model  to  the  data.  For  instance,  Zhang  (2004)  noticed  that 
it  is  not  always  possible  to  find  good  fitting  models  for  his  class  of  pure  measurement 
models.  We  therefore  would  search  for  a  subset  of  variables  that  would  be  reasonably 
represented  in  our  pure  measurement  model  formulation; 

•  finding  causal  relationships  among  latent  variables  given  a  fixed  measurement  model  for 
them.  This  was  studied  before  in  Silva  (2002)  with  a  different  clustering  algorithm.  The 
natural  extension  is  applying  similar  techniques  with  the  learning  algorithm  developed 
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in  this  work.  One  can  then  contrast  the  full  latent  variable  approach  against,  e.g.,  the 
standard  practice  in  social  sciences  of  building  scales,  where  new  variables  are  created 
as  deterministic  functions  of  indicators  (average,  for  instance)  and  graphical  models 
are  built  using  these  news  variables  instead  of  introducing  latents. 
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Appendix 
A  Proofs 

Before  presenting  proofs  for  the  lemmas  and  theorems  stated  in  the  body  of  this  text,  we  will 
introduce  the  following  notation.  Let  <7xy  denote  the  covariance  of  any  two  random  variables 
X  and  Y  and  Pxy.z  denote  the  partial  correlation  of  X  and  Y  given  Z .  The  symbol  {Ah} 
will  stand  for  a  finitely  indexed  set  of  variables. 

Also,  let  X  =  \xqL  +  A XiTji  and  Y  be  random  variables  with  zero  mean,  as  well  as 

{L,  rji,  ...,r]k}.  Let  {Ax0,  AX1, ...,  AXfc}  be  real  coefficients.  We  define  c txyl ,  the  “covariance  of 
A"  and  Y  through  L”,  as  (Txyl  =  Axo E[LY], 

Lemma  1  Let  G(O)  be  a  semilinear  latent  variable  graph.  For  some  set  O'  =  {A,  B ,  C,  D}  C 
O,  if  a, ABa cd  =  WlcTbd  =  °ad°bc  and  for  all  triplets  {X,  Y,  Z},  {X,  Y}  C  O',  Z  e  O, 
we  have  Pxy.z  ^  0  and  pXY  ^  0,  then  no  element  in  X  e  O'  is  an  ancestor  of  any  element 
in  0'\X  in  G  with  probability  1  with  respect  to  a  Lebesgue  measure  over  the  coefficient  and 
error  variance  parameters. 

Proof:  Since  G  is  acyclic  among  observed  variables,  then  at  least  one  element  in  O'  is  not 
an  ancestor  in  G  of  any  other  element  in  this  set.  By  symmetry,  we  can  assume  without 
loss  of  generality  that  D  is  such  node.  Since  the  measurement  model  is  linear,  we  can  write 
A,  B,  C,  D  as  linear  functions  of  their  parents: 

A  =  apAp 

B  =  EibiBi 
C  =  CjCj 

D  =  dkDk 

where  on  the  right-hand  side  of  each  equation  we  have  the  respective  parents  of  A,  B,  C  and 
D.  Such  parents  can  be  latents,  another  indicators  or,  for  now,  the  respective  error  term, 
but  each  indicator  has  at  least  one  latent  parent  besides  the  error  term.  Let  L  be  the  set 
of  latent  variables  in  G.  Since  each  indicator  is  always  a  linear  function  of  its  parents,  by 
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composition  of  linear  functions  we  have  that  each  X  e  O'  will  be  a  linear  function  of  its 
immediate  latent  ancestors ,  i.e.,  latent  ancestors  LXv  of  X  such  that  there  is  a  directed  path 
from  LXv  to  X  in  G  that  does  not  contain  any  other  element  of  L.  The  equations  above  can 
then  be  rewritten  as:  _ 

y  (r,  ^ApLap 

B  =  ^ BiLBi 

c  =  Y.j,  V,  i-c, 

D  =  ^ DkLDk 

where  on  the  right-hand  side  of  each  equation  we  have  the  respective  immediate  latent 
ancestors  of  A,  B,  C  and  D  and  A  parameters  are  functions  of  the  original  coefficients  of  the 
measurement  model.  Notice  that  in  general  the  sets  of  immediate  latent  ancestors  for  each 
pair  of  elements  in  O'  will  overlap. 

Since  the  graph  is  acyclic,  at  least  one  element  of  {A,  B,C}  is  not  an  ancestor  of  the 
other  two.  By  symmetry,  assume  without  loss  of  generality  that  C  is  such  a  node.  Assume 
also  C  is  an  ancestor  of  D.  We  will  prove  by  contradiction  that  this  is  not  possible.  Let  L 
be  a  latent  parent  of  C,  where  the  edge  from  L  into  C  is  labeled  with  c,  corresponding  to 
its  linear  coefficient.  We  can  rewrite  the  equation  for  C  as 

C  =  cL  +  J2  (2) 

3 

where  by  an  abuse  of  notation  we  are  keeping  the  same  symbols  \q  and  Lq  to  represent 
the  other  dependencies  of  C.  Notice  that  it  is  possible  that  L  =  Lcj  for  some  Lc:j  if  there 
is  more  than  one  directed  path  from  L  to  C,  but  this  will  not  be  relevant  for  our  proof.  In 
this  case,  the  corresponding  coefficient  A  is  modified  by  subtracting  c.  It  should  be  stressed 
that  the  symbol  c  does  not  appear  anywhere  in  the  polynomial  corresponding  to  JA  A Cj^Cj, 
where  in  this  case  the  variables  of  the  polynomial  are  the  original  coefficients  parameterizing 
the  measurement  model  and  the  immediate  latent  ancestors  of  C. 

By  another  abuse  of  notation,  rewrite  A,  B  and  D  as 

A  =  cuaL  +  A apLap 

B  =  cubL  +  y>2l  XsiLsi 
D  =  cujdL  +  yBk  A  DkLDk 

Each  ujv  symbol  is  a  polynomial  function  of  all  (possible)  directed  paths  from  C  to 
Xv  e  { A ,  B ,  D},  as  illustrated  in  Figure  5.  The  possible  corresponding  \Xv  coefficient  for  L 
is  adjusted  in  the  summation  by  subtracting  cujXvt  (again,  L  may  appear  in  the  summation 
if  there  are  directed  paths  from  L  to  Xv  that  do  not  go  through  C ).  If  C  has  more  than  one 
parent,  then  the  expression  for  ujv  will  appear  again  in  some  XXvt-  However,  the  symbol  c 
cannot  appear  again  into  any  A_Y„t ,  since  tuv  summarizes  all  possible  directed  paths  from  C  to 
Xv.  This  remark  will  be  very  important  later  when  we  factorize  the  expression  corresponding 
to  the  tetrad  constraints.  Notice  that,  by  assumption,  oja  =  u>b  =  0,  and  ud  0.  We  keep 
ua  and  Ub  in  our  equations  to  account  for  the  next  cases,  where  we  will  prove  that  B  and 
A  cannot  be  ancestors  of  D.  The  reasoning  will  be  analogous,  but  the  respective  cus  will  be 
nonzero. 
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Figure  5:  (a)  The  symbol  c od  is  defined  as  the  sum  over  all  directed  paths  from  C  to  D  of  the 
product  of  the  labels  of  each  edge  that  appears  in  each  path.  Here  the  larger  edges  represent 
edges  in  such  directed  paths,  (b)  An  example:  we  have  two  directed  paths  from  C  to  D. 
The  symbol  ujd  then  stands  for  on  +  a2«3,  where  each  term  in  this  polynomial  corresponds 
to  one  directed  path.  Notice  that  it  is  not  possible  to  obtain  any  additive  term  that  forms 
LUd  out  of  the  product  of  some  \ap,  ^bv  Xc  ,  since  D  is  not  an  ancestor  of  any  of  them:  in 
our  example,  an  and  a2  cannot  appear  in  any  A^A^Ac,;  product  (03  may  appear  if  X  is  an 
ancestor  of  A  or  B). 

Another  important  point  to  be  emphasized  is  that  no  term  inside  ojd  can  appear  in  the 
expression  for  A  and  B.  That  happens  because  D  is  not  an  ancestor  of  A,  B  or  C ,  and  at 
least  the  edges  from  the  parents  of  D  to  D  cannot  appear  in  any  trek  between  any  pair  of 
elements  in  {A,  B,C}  and  every  term  inside  c od  contains  the  label  of  one  edge  between  a 
parent  of  D  and  D.  This  remark  will  also  be  very  important  later  when  we  will  factorize  the 
expression  corresponding  to  the  tetrad  constraints. 

By  the  definitions  above,  we  have: 

&AB  =  c2uauba2L  + cuaY,^BicrLBiL  + CUbY,^ApCrLApL  +  Y^Y,^Ap^BiCrLApLB. 

aCD  =  C2UJda2L  +  c  XI  ADfe  0rLDk  L  +  CUJdYj  ^Cj  &LCj  L  +  Y,Yj^cAnk  &LCj  LDk 

a  AC  =  c2ua(72L  +  cua  £  A  c^Lc.l  +  c  £  A  ap^lApl  +  £  £  ^apXc^lAplc. 

CbD  =  C2UJbUJd&L  +  CCJb  £  ^Dk&LDkL  +  CUd  £  A^CT^^l  +  £  £  XBi^DkcrLBiLDk 

Consider  the  polynomial  identity  (Jab&cd  —^ac^bd  =  0  as  a  function  of  the  parameters 
of  the  measurement  model,  i.e.,  the  linear  coefficients  and  error  variances  for  the  observed 
variables.  Assume  this  constraint  is  entailed  by  G  and  its  unknown  latent  covariance  matrix. 
With  a  Lebesgue  measure  over  the  parameters,  this  will  hold  with  probability  1,  which  follows 
from  the  fact  that  the  solution  set  to  non-trivial  polynomial  constraints  has  measure  zero. 
See  Meek  (1997)  and  references  within  for  more  details.  This  also  means  that  every  term 
in  this  polynomial  expression  should  vanish  to  zero  with  probability  1:  i.e.,  the  coefficients 
(functions  of  the  latent  covariance  matrix)  of  every  term  in  the  polynomial  should  be  zero. 
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Therefore,  the  sum  of  all  terms  with  a  factor  ay#  =  lil2...lz  at  a  given  choice  of  exponents  for 
each  li, ...,  lz  should  be  zero,  where  uidt  is  some  term  inside  the  polynomial  c y# 

Before  using  this  result,  we  need  to  identify  precisely  which  elements  of  the  polynomial 
vabctcd  ~  vac&bd  can  be  factored  by,  say,  c2ay#,  for  some  ay#.  This  can  include  elements 
from  any  term  that  will  explicitly  show  c20Jd  when  multiplying  the  covariance  equations 
above  among  others,  but  we  have  to  consider  the  multiplicity  of  the  factors  that  compose 
ay#.  Let  u>dt  =  hh ■■■h-  We  want  to  factorize  our  tetrad  constraint  according  to  terms 
that  contain  lil2...lz  with  multiplicity  1  for  each  label  (i.e.,  our  terms  cannot  include  l2, 
for  instance,  or  some  subset  of  {h,  ■■-,lz})-  Since  C  does  not  have  some  descendant  X 
that  is  a  common  ancestor  of  A  and  D  or  B  and  D,  this  means  that  no  algebraic  term 
LUa,LUb  or  \Ap,^Bi  can  contain  some  symbol  in  Notice  that  some  Xoks  will  be 

functions  of  ay#:  every  immediate  latent  ancestor  of  C  is  an  immediate  latent  ancestor  of  D. 
Therefore,  for  each  common  immediate  latent  ancestor  parent  Lq  of  C  and  D ,  we  have  that 
A Dq  =  vdXCq  +  t(Lq,  D )  =  udt \Cq  +  (ay*  ~  udt)XCq  +  t(Lq,  D) ,  where  t(Lq,  D )  is  a  polynomial 
representing  other  directed  paths  from  Lq  to  D  that  do  not  go  through  C. 

For  example,  consider  the  expression  c2coa  (^2  Xb^Lb  L^  Xr>k(?LDki)j ,  which  is  an 
additive  term  inside  the  product  (J ab&cd  •  If  we  group  only  those  terms  inside  this  expression 
that  contain  ay#,  we  will  get  c2ujaudt  (j2  XBi^LB.L )  (YXc^l^lJ  where  the  index  j  runs 

over  the  same  latent  ancestors  as  in  (2).  As  discussed  before,  no  factor  of  c odt  can  be  a  factor 
of  any  term  in  A The  same  holds  for  c oa.  Therefore,  the  multiplicity  of  each  L  in 
this  term  is  exactly  1. 

When  one  writes  down  the  algebraic  expression  for  <3ab&cd  —  oac&bd  as  functions  of  As, 
c,  o;a,ay,,ay#,  the  terms 

C2(jJdAaL  Y  Y  XApXBi^LApLB.  +  UaUb<y\  Y  Y  XCjXcjfVLcjLc.,  +  ^aY  Xb^Lb.L  Y  A  Cj^LCjL+ 

Ub  Y  Xap&LApL  Y  A CjVLcjl]- 

C2UJdt[Ub<J2L  Y  Y  XApXCjCrLApLCj  +UaCr2L  Y  Y  XuXcni.g  +  UaUb  Y  XCjVLcjL  Y  X CjVLCjL+ 

Y  Xap(?lApl  Y  Xr,(7Lb.i] 

will  be  the  only  ones  that  can  be  factorized  by  c2oy#,  where  the  power  of  c  in  such  terms  is 
2,  and  the  multiplicity  of  each  li, ...,  lz  is  1.  Since  this  has  to  be  identically  zero  and  oudt  A1  0, 
we  have  the  following  relation: 

/i(G)  =  /2(G)  (3) 

where 

/l(G)  =  C2[a2L  ^2  Y  XApXBiVLApLB.+UaUbV2L  Y  Y  XCj  Xc.,  <^LC^  Lc^,  +A  Y  Xb^Lb.L  Y  Xcj^LCjL  + 

Ub  Y  Xap°LApL  Y  X CjVLCjL\ 

h  (G)  =  C2[uJb(?l  Y  Y  XApXCj(TLApLCj+UaCT2L  Y  Y  XBjX  Cj<TLB.Lo.+UaUb  Y  XC^LCjL  Y  A  C^Lc.L  + 

Y  Xap°lApl  Y  XBi(7Lb.l] 

Similarly,  when  we  factorize  terms  that  include  coy#,  where  the  respective  powers  of 
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c,li, ... ,lz  in  the  term  have  to  be  1,  we  get  the  following  expression  as  an  additive  term  of 

®AB®CD  —  ®AC®BD'- 

CUdt[uJa  X  XBi®Lg.L  X  X  XC,  ACV,  ®  Lc.  LCj,  +  Ub  X  XAP  ®LApL  X  X  XC ,XC  ,® l.r  !.<■.,  + 

2  5]  XCj®LCjL  Y  Y  XApXBi®  LApLB.}- 

CUJdt[ua  Y  XCj®LCjL  Y  Y  XH;X( 'd1/ Lc.  +  Y  XAP®  LApL  Y  Y  XBi^Cj®LB.LCj  + 

Y  XCj®LCjL  Y  Y  XApXCj®LApLCj  +  Y  XB{®  LB.L  Y  Y  XApXCj°LApLCj] 

for  which  we  have: 

gi(G)=g2{G)  (4) 

where 

9i(G)  =  c[  Ua  Y  XBi®LB.L  Y  Y  XC  -j\  >a/ r,  LCj,  +  UbY  XAp®LApL  Y  Y  LCjLc + 

2  XI  XCj®LCjL  Y  Y  XApXBi®  LApLB.} 

92(G)  =  c[uja  Y  ^CjVLcjL  Y  Y  ^Bi^CjVLg.Lc.  +  X  XAp®LApL  X  X  XBi^Cj® Lg.Lc.  + 

^b  X  XCj®LCjL  X  X  XApXCj®LApLCj  +  X  XBi®LB.L  X  X  XAV XCj ®  LAp LCj] 


Finally,  we  look  at  terms  multiplying  c odt  without  c,  which  will  result  in: 

MG)  =  /12(G)  (5) 

where 

MG)  =  YY1  XAP^B^LApLBi  Y  Y  X(  'iXr/T,  Cj  i-r., 
h2(G)  =  Y  Y  XAPXCj®LApLCj  Y  XBiXCjV Lg.Lc j 

Writing  down  the  full  expression  for  gac®bc  and  GqGab  will  result  in: 

&  ac®  bc  =  P(G)  +  /2(G)  +  MG)  +  h2(G )  (6) 

®c®AB  —  P(G )  +  /i(G)  +  (71(G)  +  hi(G)  (7) 

where 

P(G )  =  c4o;acc;fe(u|)2  +  c3(uacafeaf  YxCj°lc  l  +  c3ujaG2LYXBi°LB.L+ 

C3UJaUJbal  X  XCj®LCjL  +  C2cua  X  XCj® LCjL  X 

cW2  X  XAP ®LApL  +  C2Ub  X  XCj®LCjL  X  XAp®LApL 

By  (3),  (4),  (5),  (6)  and  (7),  we  have: 

®AC®BC  =  ®c®AB  =>■  OAB  —  =  0  Pab.c  =  0 

Contradiction.  Therefore,  G  cannot  be  an  ancestor  of  D,  and  more  generally,  of  any 
element  in  Q'\C. 
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Assume  without  loss  of  generality  that  B  is  not  an  ancestor  of  A.  C  is  not  an  ancestor 
of  any  element  in  0'\C.  If  B  does  not  have  a  descendant  that  is  a  common  ancestor  of  C 
and  D,  then  by  analogy  with  the  ( C ,  D)  case  (where  now  more  than  one  u  element  will  be 
nonzero  as  hinted  before,  since  we  have  to  consider  the  possibility  of  B  being  an  ancestor  of 
both  C  and  D ),  B  cannot  be  an  ancestor  of  C  nor  D. 

Assume  then  that  B  has  a  descendant  A"  that  is  a  common  ancestor  of  C  and  D,  where 
I  ^  C  and  A"  D,  since  C  is  not  an  ancestor  of  D  and  vice-versa.  Notice  also  that  X  is 
not  an  ancestor  of  A,  since  B  is  not  an  ancestor  of  A.  Relations  such  as  Equation  3  might 
not  hold,  since  we  might  be  equating  terms  that  have  different  exponents  for  symbols  in 
{/ 1, ...,  /2}.  However,  since  now  we  have  an  observed  intermediate  term  X,  we  can  make  use 
of  its  error  variance  parameter  ( x  corresponding  to  the  error  term  ex- 

No  term  in  ctab  can  have  (x,  since  €\  is  independent  of  both  A  and  B.  There  is  at  least 
one  term  in  gqd  that  contains  Qx  as  a  factor.  There  is  no  term  in  gac  that  contains  ( x 
as  a  factor,  since  ex  is  independent  of  A.  There  is  no  term  in  gbd  that  contains  (x  as  a 
factor,  since  ex  is  independent  of  B.  Therefore,  in  gab&cd  we  have  at  least  one  term  that 
has  Cy,  while  no  term  in  gacotbd  contains  such  term.  That  requires  some  parameters  or  the 
variance  of  some  latent  ancestor  of  B  to  be  zero,  which  is  a  contradiction. 

Therefore,  B  is  not  an  ancestor  of  any  element  in  Or\B.  In  a  completely  analogous  way, 
one  can  show  that  A  is  not  an  ancestor  of  any  element  in  07\A  □ 

Lemma  2  Let  G( O)  be  a  semilinear  latent  variable  model.  Let  {A,  B ,  C,  D}  C  O  such  that 
A  is  not  an  ancestor  of  B,  C  or  D  in  G  and  A  has  a  parent  L  in  G,  and  no  element  of  the 
covariance  matrix  of  A,B,C  and  D  is  zero.  If  gac&bd  =  ^ad^bc,  then  or acl  =  &adl  =  0 
or  gacl/vadl  =  or ac  !  or  ad  =  otbc/otbd  with  probability  1  with  respect  to  a  Lebesgue  measure 
over  the  coefficient  parameters. 

Proof:  Since  G  is  a  linear  latent  variable  graph,  we  can  express  A,  B.  C  and  D  as  linear 
functions  of  their  parents  as  follows: 

A  a.L  T  y  A  apAp 
B  =  ZibiBi 

c  =  E  jCjCj 

D  =  Yhk  dkDk 

where  on  the  right-hand  side  of  each  equation  the  uppercase  symbols  denote  the  respective 
parents  of  each  variable  on  the  left  side,  error  terms  included. 

Given  the  assumptions,  we  have: 


0 

Since  A  is  not  an  ancestor  of  R,  C  or  D,  there  is  no  trek  among  elements  of  { B ,  C ,  D}  con¬ 
taining  both  L  and  A,  and  therefore  the  symbol  a  cannot  appear  in  apcjOTApCjOrBD  ~ 

EPEkaPdkcrApDkaBC  when  we  expand  each  covariance  as  a  function  of  the  parameters  of 


otacotbd  —  or adotbc 

E\a  ciLCj  +  CLpCjApCfaBD  =  E[a  £ k  dkLDk  +  Y,p  J2k  aPdkApDk}aBc 

a(Sj  cjOTLCj)orBD  +  oipCjGApCjOTBD  =  a(£fc  dkGLDk)aBC  +  apdka ApDkcr BC 

aKZj^LC^BD  -  {Y,kdkOLDk)°Bc)\  +  E  pY^japCjOTApCjOrBD  ~  apdkOT  ApDkCTBc\  = 
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G.  Therefore,  since  this  polynomial  is  identically  zero,  we  have  to  have  the  coefficient  for  a 
equal  to  zero,  which  implies: 


CjaLCi)<7BD  =  a(y~!  dkCT  LDk)a  EC  =  &  ACL  &BD  =  &  ADL&BC 
3  k 

Since  no  element  in  Yabcd  is  zero,  then  ctacl  =  0  yy  <Jadl  =  0.  If  (Jacl  f  0,  then 
vacl/cadl  =  vac/cad  =  o'bc/o'bd ■  1=1 

Lemma  3  LetG{ O)  be  a  semilinear  latent  variable  graph.  Assume  {Ad,  X2,  X3,  Yi,  Y2, 13}  C 
O  and  <tx1y1&x2x3  =  o’x1x2°’x3y1  =  0~x1X3CrX2Y1,  CX1Y1CrY2Y3  =  o'x1y2o’y1y3  =  Crx1Y30’Y1Y2, 
(7xxx2(7yxy2  f  <7x1y2o’x2y1  and  that  for  all  triplets  {A,  B,  C},  { A ,  B}  C  {XY  X2 ,  X3,  Yh  Y2,  Yi}, 
C  e  O,  we  have  pab  f  0 ,  pab.c  f  0.  Then  Afi  and  Y\  do  not  have  a  common  parent  in  G 
with  probability  1  with  respect  to  a  Lebesgue  measure  over  the  coefficient  and  error  variance 
parameters. 


Proof:  Suppose  Afi  and  Y\  have  a  common  parent  L  in  G.  Let  Xl  =  aL  +  apAp  and 

Yx  =  bL  +  biBi ,  where  each  Ap,  Bi  are  parents  in  G  of  Xt  and  Y\,  respectively. 

By  Lemma  1  and  the  given  constraints,  an  element  of  {A^  Y\ }  cannot  be  an  ancestor 
of  the  other,  and  neither  can  be  an  ancestor  in  G  of  any  element  in  {X2,  X3,  Y2,  Yf.  By 
definition,  <Tx1  v l  =  (a/b)<jYlvL  for  some  variable  V,  and  therefore  ctXivl  =  0  yy  aYl  y l  = 
0.  Assume  aYlx2r  =  ctXix2l  =  0.  Since  it  is  given  that  aXlYlaX2x3  =  &x iX2OYiX3,  by 

Lemma  2  we  have  <tXiYil  =  &Xl x2l  =  0.  Since  aXlYlL  =  abaf  +  K ,  where  no  term 

in  K  contains  the  factor  ab,  then  if  aXlYlL  =  0,  with  probability  1  abaf  =  0  =y  of  = 
0,  which  is  a  contradiction  of  the  assumptions.  By  repeating  the  argument,  no  element 
hi  {crXlX2L,  &XiX3l,  crYlX2L,  crYxX3L,  ctXiy2l,  &xiY3l,  cYiy2l,  ct Yl y3 jJ }  is  zero.  Therefore,  since 
crXlYl crX2x3  =  0 XiX.2(7 x?jYl  =  o‘Xi.Y30x2yi  by  assumption,  from  Lemma  2  we  have 

°xxx3  =  crXlx3L  /g\ 

O’XnYi  TX3YlL 

and  from  crXlYlaY2Y3  =  ctXiy2<tYiy3  =  o’x1y3o'y1y2 

aYlY3  =  O Y3Y3L  /qn 

aXiY3  @X\Y3L 

Since  no  covariance  among  the  given  variables  is  zero, 


Vx1x2&y1x3 
0’X1Y20'y1Y3 
<LXl  X2OyxY2 

From  (8),  (9)  it  follows: 


Q'x1x3crY1x2 

o'x1Y3crYy2 

Vx1x3vYiy3 

(7XlY2(y  Yl  X2 - 

<yYlx3<yXlY3 


&XiX2GYi  Y2 


VxiYiVYiX2 

(7X1Y2(7Y1X2 


o'x1x3l&y1y3l 

0’YiX3L^X1Y3L 

0 a/b)crYlX3L(b / a)crXlY3L 
&Yix:3l&XiY3l 


VxiY2VYiX2 
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Contradiction.  □ 


The  following  lemma  will  be  useful  to  prove  the  correctness  of  CS2: 

Lemma  4  Let  G(0)  be  a  linear  latent  variable  model,  and  let  {Ai,  A2,  A3,  A4)  C  O  be  such 
that  a x1x2(J x3x4  =  &x1x3&x2x4  =  ^x1x4<^x2x3-  If  Pab  7^  0  for  all{A,B}  C  {X\,  X2,  A3,  X4}, 
then  an  unique  choke  point  P  entails  all  the  given  tetrad  constraints,  and  P  d-separates  all 
elements  in  {A4,  A2,  A3,  A4}. 

Proof:  Let  P  be  a  choke  point  for  pairs  {X\,X2}  x  {A3,  A4}.  Let  Q  be  a  choke  point  for 
pairs  {A4,  A3}  x  {A2,  A4}.  We  will  show  that  P  =  Q  by  contradiction. 

Assume  P  Q.  Because  there  is  a  trek  that  links  X\  and  X4  throught  P  (since  px4x4  ^ 
0),  we  have  that  Q  should  also  be  on  that  trek.  Suppose  T  is  a  trek  connecting  Xt  to  Ad 
through  P  and  Q,  and  without  loss  of  generality  assume  this  trek  follows  an  order  that  defines 
three  subtreks:  T0,  from  Ad  to  P;  T),  from  P  to  Q]  and  T2,  from  Q  to  Ad,  as  illustrated  by 
Figure  6(a).  In  principle,  To  and  T2  might  be  empty,  i.e.,  we  are  not  excluding  the  possibility 
that  Ad  =  P  or  Ad  =  Q. 

There  must  be  at  least  one  trek  Tq2  connecting  X2  and  Q,  since  Q  is  on  every  trek  between 
Ad  and  Ad  and  there  is  at  least  one  such  trek  (since  px4x2  7^  0)-  We  have  the  following  cases: 

Case  1:  Tq2  includes  P.  Tq2  has  to  be  into  P,  and  P  ^  Ad,  or  otherwise  there  will  be  a  trek 
connecting  Ad  to  Ad  through  a  (possibly  empty)  trek  T0  that  does  not  include  Q ,  contrary 
to  our  hypothesis.  For  the  same  reason,  T0  has  to  be  into  P.  This  will  imply  that  Ti  is  a 
directed  path  from  P  to  Q,  and  T2  is  a  directed  path  from  Q  to  X4  (Figure  6(b)). 

Because  there  is  at  least  one  trek  connecting  Ad  and  X2  (since  px4x2  7^  0),  and  because  Q 
is  on  every  such  trek,  Q  has  to  be  an  ancestor  of  at  least  one  member  of  {Ad,  Ad}.  Without 
loss  of  generality,  assume  Q  is  an  ancestor  of  Ad.  No  directed  path  from  Q  to  Ad  can  include 
P,  since  P  is  an  ancestor  of  Q  and  the  graph  is  acyclic.  Therefore,  there  is  a  trek  connecting 
Ad  and  Ad  with  Q  as  the  source  that  does  not  include  P,  contrary  to  our  hypothesis. 

Case  2:  Tq2  does  not  include  P.  This  is  case  is  similar  to  Case  1.  Tq2  has  to  be  into  Q,  and 
Q  d  Ad,  or  otherwise  there  will  be  a  trek  connecting  Ad  to  Ad  through  a  (possible  empty) 
trek  T-2  that  does  not  include  P,  contrary  to  our  hypothesis.  For  the  same  reason,  T2  has  to 
be  into  P.  This  will  imply  that  T4  is  a  directed  path  from  Q  to  P,  and  T0  is  a  directed  path 
from  P  to  Aj .  An  argument  analogous  to  Case  1  will  follow. 

We  will  now  show  by  contradiction  that  P  d-separates  all  nodes  in  {Ad,  Ad,  A3,  Ad}. 
From  the  P  =  Q  result,  we  know  that  P  lies  on  every  trek  between  any  pair  of  elements  in 
{Ad,  A2,  Ad,  Ad}.  First  consider  the  case  where  at  most  one  element  of  {Ad,  Ad,  A3,  Ad}  is 
linked  to  P  through  a  trek  that  is  into  P.  By  the  Tetrad  Representation  Theorem,  any  trek 
connecting  two  elements  of  {A1?  A2,  A3,  A4}  goes  through  P.  Since  P  cannot  be  a  collider 
on  any  trek,  then  P  d-separates  these  two  elements. 

Without  loss  of  generality,  assume  there  is  a  trek  connecting  Ad  and  P  that  is  into  P, 
and  a  trek  connecting  Ad  and  P  that  is  into  P.  If  there  is  no  trek  connecting  Xt  and  P 
that  is  out  of  P  neither  any  trek  connecting  Ad  and  P  that  is  out  of  P,  then  there  is  no  trek 
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Figure  6:  In  (a),  a  depiction  of  a  trek  T  linking  A]  and  X4  through  P  and  Q,  creating  three 
subtreks  labeled  as  T0,  Ti  and  T2.  Directions  in  such  treks  are  left  unspecified.  In  (b),  the 
existence  of  a  trek  Tq2  linking  A2  and  Q  through  P  will  compel  the  directions  depicted  as 
a  consequence  of  the  given  tetrad  and  correlation  constraints  (the  dotted  path  represents 
any  possible  continuation  of  Tq2  that  does  not  coincide  with  T ).  The  configuration  in  (c) 
cannot  happen  if  P  is  a  choke  point  entailing  all  three  tetrads  among  marginally  dependent 
nodes  {Ad,  Ad,  A3,  Ad}.  The  configuration  in  (d)  cannot  happen  if  P  is  a  choke  point  for 
{Xi,  A3}  x  {X2l  A4},  since  there  is  a  trek  X4  —  P  —  X2  such  that  P  is  not  on  the  {Ad,  A3} 
side  of  it,  and  another  trek  X2  —  S  —  P  —  X3  such  that  P  is  not  on  the  {A2,  X4}  side  of  it. 

connecting  Ad  and  A2,  since  P  is  on  every  trek  connecting  these  two  elements  according 
to  the  Tetrad  Representation  Theorem.  But  this  implies  px1  x2  =  0,  a  contradiction,  as 
illustrated  by  Figure  6(c). 

Consider  the  case  where  there  is  also  a  trek  out  of  P  and  into  Ad-  Then  there  is  a  trek 
connecting  Xt  to  X2  through  P  that  is  not  on  the  {X\ ,  Ad}  side  of  pair  {X4,X3}  x  {Ad,  Ad} 
to  which  P  is  a  choke  point.  Therefore,  P  should  be  on  the  {Ad,  Ad}  of  every  trek  connecting 
elements  pairs  in  {Ad,  A3}  x  {Ad,  Ad}.  Without  loss  of  generality,  assume  there  is  a  trek 
out  of  P  and  into  A3  (because  if  there  is  no  such  trek  for  either  A3  and  Ad,  we  fall  in  the 
previous  case  by  symmetry).  Let  S  be  the  source  of  a  trek  into  P  and  Ad,  which  should  exist 
since  X2  is  not  an  ancestor  of  P.  Then  there  is  a  trek  of  source  S  connecting  X;i  and  X2 
such  that  P  is  not  on  the  {Ad,  Ad}  side  of  it  as  shown  in  Figure  6(d).  Therefore  P  cannot 
be  a  choke  point  for  {Ad,  A3}  x  {Ad,  Ad}.  Contradiction.  □ 


Lemma  5  Let  G( O)  be  a  linear  latent  variable  model.  Assume  {Ai,  A2,  A3,  Y},  Y2,  Y3}  C 
O,  Ad  is  not  an  ancestor  of  X3,  Yx  is  not  an  ancestor  of  Y3,  id(Ad,A2,G)  =  true, 
F\  {Y\ ,  Y2,  G)  =  1  and  <JXiyXx2Y2  —  <jx1Y2(Tx2y1  ,  Cx2yXy2y3  =  ^x2y3^y2y1,  (JXiX2(Tx3y2  = 
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Figure  7:  Figure  (a)  illustrates  necessary  treks  among  elements  of  {X±,X2,  Yf,  Y2,  L}  accord¬ 
ing  to  the  assumptions  of  Lemma  5  if  we  further  assume  that  Ad  is  a  choke  point  for  pairs 
{Xi,X2}  x  {Y\ ,  Y2}  (other  treks  might  exist).  Figure  (b)  rearranges  (a)  by  emphasizing  that 
Yx  and  Y2  cannot  be  d-separated  by  a  single  node. 

o'x1y2(Jx3x2,  (Tx1x2(Ty1Y2  ~f~  vx1y2&x2y1  and  that  for  all  triplets  {A,  B,  C},  {A,  B}  C  { JkT ( ,  X2, 
x3,yuy2,y3},c  e  O,  we  have  pab  ^  0,  Pab.c  ^  0-  Then  Ad  and  Y\  do  not  have  a  common 
parent  in  G. 

Proof:  We  will  prove  this  result  by  contradiction.  Assume  X{  and  Yl  have  a  common 
parent  L.  Because  of  the  tetrad  constraints  given  by  hypothesis  and  the  existence  of 
the  trek  Ad  <—  L  — »  Id,  one  node  in  {X\,  L,  Y\ }  should  be  a  choke  point  for  the  pair 
{Xi,  X2}  x  {Yi,Y2}.  We  will  first  show  that  L  has  to  be  such  a  choke  point,  and  therefore 
lies  on  every  trek  connecting  Xt  and  Y2,  as  well  as  X2  and  Id.  We  then  show  that  L  lies  on 
every  trek  connecting  Y\  and  Y2,  as  well  as  Xt  and  X2.  Finally,  we  show  that  L  is  a  choke 
point  for  {Xi,  Y{\  x  {X2,Y2},  contrary  to  our  hypothesis. 

Step  1:  If  there  is  a  common  parent  L  to  Xi  and  Y\ ,  then  L  is  a  {X\,X2}  x  {Y\,Y2}  choke 
point.  For  the  sake  of  contradiction,  assume  Xt  is  a  choke  point  in  this  case.  By  Lemma  1 
and  assumption  Fi(Xi,X2,G),  we  have  that  X1  is  not  an  ancestor  of  X2,  and  therefore  all 
treks  connecting  Ad  and  X2  should  be  into  X\.  Since  px2Y2  ^  0  by  assumption  and  X\  is  on 
all  treks  connecting  X2  and  Y2,  there  must  be  a  directed  path  out  of  X\  and  into  Y2.  Since 
Px^.x-t  ~f~  0  by  assumption  and  X\  is  on  all  treks  connecting  X2  and  Y2,  there  must  be  a 
trek  into  X1  and  Y2.  Because  px2Y2  ^  0,  there  must  be  a  trek  out  of  Ad  and  into  Y\ .  Figure 
7(a)  illustrates  the  configuration. 

Since  F1(Yi,Y2,G)  is  true,  by  Lemma  4  there  must  be  a  node  d-separating  Y\  and  Y2 
(neither  Y\  nor  Y2  can  be  the  choke  point  in  F\  (Id,  Y2,  G)  because  this  choke  point  has  to  be 
latent,  according  to  the  partial  correlation  conditions  of  id).  However,  by  Figure  7(b),  treks 
T-2  —  T3  and  Tl  —  T4  cannot  both  be  blocked  by  a  single  node.  Contradiction.  Therefore  Ad 
cannot  be  a  choke  point  for  {Ad,  Ad}  x  {Y\ ,Y2}  and,  by  symmetry,  neither  can  Y\. 
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x2  x2  x2 

(a)  (b)  (c) 


Figure  8:  In  (a),  a  depiction  of  Ty  and  Tx,  where  edges  represent  treks  (Tx  can  be  seen 
more  generally  as  the  combination  of  the  solid  edge  between  X2  and  P  concatenated  with  a 
dashed  edge  between  P  and  Yi  representing  the  possibility  that  TY  and  Tx  might  intersect 
multiple  times  in  Tpy,  but  in  principle  do  not  need  to  coincide  in  Tpy  if  P  is  not  a  choke 
point.)  In  (b),  a  possible  configurations  of  edges  <  X_i ,P  >  and  <  P,  Y+1  >  that  do  not 
collide  in  P,  and  P  is  a  choke  point  (and  Y+\  ^  Y ).  In  (c),  the  edge  <  Y_ i,  P  >  is  compelled 
to  be  directed  away  from  P  because  of  the  collider  with  the  other  two  neighbors  of  P. 

Step  2:  L  is  on  every  trek  connecting  Y\  and  Y2  and  on  every  trek  connecting  X1  and  X2. 
Let  L  be  the  choke  point  for  pairs  {X\,X2}  x  {Y\,Y2}.  As  a  consequence,  all  treks  between 
Y2  and  X]  go  through  L.  All  treks  between  X2  and  Y\  go  through  L.  All  treks  between  X2 
and  Y2  go  through  L.  Such  treks  exist,  since  no  respective  correlation  vanishes. 

Consider  the  given  hypothesis  cr\2Y1cry2y3  =  aX2y3ay2Yl,  corresponding  to  a  choke  point 
{X2,Y2}  x  {Yi,  Y3}.  From  the  previous  paragraph,  we  know  there  is  a  trek  linking  Y2  and  L. 
L  is  a  parent  of  Y\  by  construction.  That  means  Y2  and  Y\  are  connected  by  a  trek  through 
L. 

We  will  show  by  contradiction  that  L  is  on  every  trek  connecting  Y\  and  Y2.  Assume 
there  is  a  trek  Ty  connecting  Y2  and  Y2  that  does  not  contain  L.  Let  P  be  the  first  point  of 
intersection  of  Ty  and  a  trek  Tx  connecting  X2  to  Yi,  starting  from  X2.  If  Ty  exists,  such 
point  should  exist,  since  Ty  should  contain  a  choke  point  {X2,Y2}  x  {Y\ ,  Y3},  and  all  treks 
connecting  X2  and  Y\  (including  Tx )  contain  the  same  choke  point. 

Let  Tpy  be  the  subtrek  of  Ty  starting  on  P  and  ending  one  node  before  Y\ .  Any  choke 
point  {X2,  Y2}  x  {Yi,  Y3}  should  lie  on  TPY  (Figure  8(a)).  ( Yx  cannot  be  such  a  choke  point, 
since  all  treks  connecting  Y\  and  Y2  are  into  Yi,  and  by  hypothesis  all  treks  connecting  Yi 
and  Y3  are  into  Yi.  Since  all  treks  connecting  Y2  and  Y3  would  need  to  go  through  Yi  by 
definition,  then  there  would  be  no  such  trek,  implying  py2y3  =  0,  contrary  to  our  hypothesis.) 

Assume  first  that  X2  ^  P  and  Y2  ^  P.  Let  A"_i  be  the  node  before  P  in  Tx  starting 
from  X2.  Let  Y_\  be  the  node  before  P  in  Ty  starting  from  Y2.  Let  Y+ 1  be  the  node  after 
P  in  Ty  starting  from  Y2  (notice  that  it  is  possible  that  Y'+1  =  Yi).  If  X_\  and  Y+i  do 
not  collide  on  P  (i.e.,  there  is  no  structure  X_i  — »•  P  Y+1),  then  there  will  be  a  trek 
connecting  X2  to  Y\  through  Tpy  after  P.  Since  L  is  not  in  Tpy,  L  should  be  before  P  in 
Tx.  But  then  there  will  be  a  trek  connecting  X2  and  Yi  that  does  not  intersect  Tpy,  which 
is  a  contradiction  (Figure  8(b)).  If  the  collider  does  exist,  we  have  the  edge  P  Y+1.  Since 
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Figure  9:  In  (a),  P2  and  X1  cannot  share  a  parent,  and  because  of  the  given  tetrad  constraints, 
L  should  d-separate  M  and  P3.  P3  is  not  a  child  of  L  either,  but  there  will  be  a  trek  linking 
L  and  (not  necessarily  into)  Y3.  In  (b),  a  set  of  possible  configurations  for  Ad  and  Ad,  where 
X3  has  some  parent  in  the  trek  linking  M  and  L.  In  (c),  another  variation  where  now  X2 
and  X3  share  a  parent  in  that  trek. 

no  collider  P_i  — >  P  Y+\  can  exist  because  Ty  is  a  trek,  the  edge  between  P_i  and  P  is 
out  of  P.  But  that  forms  a  trek  connecting  X2  and  V2  (Figure  8(c)),  and  since  L  is  in  every 
trek  between  X2  and  Y2  and  Ty  does  not  contain  L ,  then  Ty  should  contain  L  before  P, 
which  again  creates  a  trek  between  X2  and  Y\  that  does  not  intersect  Tpy. 

If  X2  =  P,  then  Tpy  has  to  contain  L ,  because  every  trek  between  X2  and  Y\  contains  L. 
Therefore,  X2  d  P.  If  Y2  =  P,  then  because  every  trek  between  X2  and  Y2  should  contain  L, 
we  again  have  that  L  lies  in  T\  before  P,  which  creates  a  trek  between  X2  and  Y\  that  does 
not  intersect  Tpy.  Therefore,  we  showed  by  contradiction  that  L  lies  on  every  trek  between 
Y2  and  P,. 

Consider  now  the  given  hypothesis  crx1x2<Jx3Y2  —  ^x1y2^x3x2i  corresponding  to  a  choke 
point  {Ad,P2}  x  {Ad,  Ad}.  By  symmetry  with  the  previous  case,  all  treks  between  Ad  and 
X2  go  through  L. 

Step  3:  If  L  exists,  so  does  a  choke  point  {Ad,  Pi}  x  {X2,Y2}.  By  the  previous  steps,  L 
intermediates  all  treks  between  elements  of  the  pair  {Ad,  Pi}  x  {X2,Y2}.  Because  L  is  a 
common  parent  of  {Ad,  Pi},  it  lies  on  the  {Ad,  Pi}  side  of  every  trek  connecting  pairs  of 
elements  in  {Ad,  Pi}  x  {AYP2}.  L  is  a  choke  point  for  this  pair.  This  implies  (7xx v2 °Yi y2  = 
a x1Y2cfx2y1 ■  Contradiction.  □ 

Lemma  6  Let  G(O)  be  a  linear  latent  variable  graph.  Assume  {X1}  X2,  X3,Y1,Y2,Y3}  C 
O  and  aXlYXY2Y3  =  o’x1y2o’y1y3  =  ox^Qy^i  (Jx1y2(Jx2x3  =  crXl x2°y2x3  =  orx1x3<^x2Y2, 
(Jx1yXx2x3  =  Tyi x2°'y3x3  =  crx1x3&x2Y3,  <^x1x2aY2Y3  7^  0 xx Y2a x2y3  and  that  for  all  triplets 
{A,  P,  C},  {A,  B}  C  {Ad,  Ad,  Ad,  Pi,  P2,  P3},  C  e  O,  we  have  pAB  7^  0,  pAB.c  ^  0.  Then  Ad 
and  Pi  do  not  have  a  common  parent  in  G. 

Proof:  We  will  prove  this  result  by  contradiction.  Suppose  Ad  and  Pi  have  a  common  parent 
L  in  G.  Since  all  three  tetrads  hold  in  the  covariance  matrix  of  {Ad,  Pi,  P2,  P3},  by  Lemma 
4  the  choke  point  that  entails  these  constraints  d-separates  the  elements  of  {Ad,  Pi,  Y2,  P3}. 
The  choke  point  should  be  in  the  trek  Ad  <—!/—>  Pi,  and  since  it  cannot  be  an  observed  node 
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Figure  10:  In  (a),  Y2  and  Ah  cannot  share  a  parent,  and  because  of  the  given  tetrad  con¬ 
straints,  L  should  d-separate  M  and  Y3.  Y3  is  not  a  child  of  L  either,  but  there  will  be  a 
trek  linking  L  and  Y3.  In  (b),  an  (invalid)  configuration  for  X2  and  A3,  where  they  share  an 
ancestor  between  M  and  L. 

because  by  hypothesis  no  d-separation  conditioned  on  a  single  node  holds  among  elements 
of  {Ah,  Yi,  Y2, 13},  L  has  to  be  a  latent  choke  point  for  all  pairs  of  pairs  in  {Ad,  Y\ ,  Y2,  Y3}. 

Given  the  constraints  in  the  hypothesis,  it  is  the  case  that,  by  Lemma  3,  X\  and  Y2 
cannot  share  a  parent.  Let  Tml  be  a  trek  connecting  some  parent  M  of  Y2  and  L.  Such  a 
trek  exists  because  Px3y2  7^  0. 

We  will  show  by  contradiction  that  there  is  no  node  in  Tml\L  that  is  connected  to  Y3 
by  a  trek  that  does  not  go  through  L.  Suppose  there  is  such  a  node,  and  call  it  V.  If  the 
trek  connecting  V  and  Y3  is  into  V,  and  since  V  is  not  a  collider  in  Tml,  then  V  is  either 
an  ancestor  of  M  or  an  ancestor  of  L.  If  V  is  an  ancestor  of  M,  then  there  will  be  a  trek 
connecting  Y2  and  Y3  that  is  not  through  L,  which  is  a  contradiction.  If  V  is  an  ancestor  of  L 
but  not  M,  then  both  Y2  and  Y3  are  d-connected  to  a  node  V  is  a  collider  at  the  intersection 
of  such  d-connecting  treks.  However,  V  is  an  ancestor  of  L ,  which  means  L  cannot  d-separate 
Y2  and  Y3,  a  contradiction.  Finally,  if  the  trek  connecting  V  and  Y3  is  out  of  V,  then  Y2 
and  Y3  will  be  connected  by  a  trek  that  does  not  include  L ,  which  again  is  not  allowed.  We 
therefore  showed  there  is  no  node  with  the  properties  of  V.  This  configuration  is  illustrated 
by  Figure  10(a). 

Since  all  three  tetrads  hold  among  elements  of  {Ah,  X2l  A3,  Y2},  then  by  Lemma  4,  there 
is  a  single  choke  point  P  that  entails  such  tetrads  and  d-separates  elements  of  this  set.  Since 
Tml  is  a  trek  connecting  Y2  to  Xl  through  L ,  then  there  are  three  possible  locations  for  P 
in  G\ 

Case  1:  P  =  M.  We  have  all  treks  between  A3  and  X2  go  through  M  but  not  through  L, 
and  some  trek  from  X±  to  Y3  goes  through  L  but  not  through  M.  No  choke  point  can  exist 
for  pairs  {Ah,  A3}  x  {A2,  Y3},  which  by  the  Tetrad  Representation  Theorem  means  that  the 
tetrad  <7  XiY:X  x2x3  =  (Jx1x2(Ty3x3  cannot  hold,  contrary  to  our  hypothesis. 

Case  2:  P  lies  between  M  and  L  in  Tml ■  This  configuration  is  illustrated  by  Figure  10(b). 
As  before,  no  choke  point  exists  for  pairs  {Ah,  A3)  x  {A2, 13},  contrary  to  our  hypothesis. 

Case  3:  P  =  L.  Because  all  three  tetrads  hold  in  {Ah,  A2,  A3,  Y3}  and  L  d-separates  all 
pairs  in  {A1?A2,A3},  one  can  verify  that  L  d-separates  all  pairs  in  {A1;  A2,  A3,  Y3}.  This 
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L'2 


L3 


Li _ 

1.0 

0.4636804781967626 

0.31177237495755117 

0.8241967922523632 

0.5167659523766029 


1.0 

0.1445627639088577 

0.6834605230188671 

0.428525239857415 


LA 


L5 


1.0 

0.45954945371001815  1.0 

0.28813447630828753  0.7617079965565864  1.0 


Table  8:  A  counterexample  that  can  be  used  to  prove  Lemma  7. 


will  imply  a  {Xi,!^}  x  {A .2,Y2}  choke  point,  contrary  to  our  hypothesis.  □ 


Lemma  7  CSS  is  not  sound  for  semilinear  latent  variable  graphs. 

Proof:  In  order  to  show  this,  one  has  only  to  construct  a  semilinear  latent  variable  graph 
with  a  latent  covariance  such  that  it  entails  all  constraints  of  CS3  but  where  Ad  and  Y\ 
have  a  same  parent.  Notice  that  the  definition  of  entailment  in  semilinear  graphs  allows  us 
to  choose  specific  latent  covariance  matrices  but  the  constraints  should  hold  for  any  choice 
of  linear  coefficients  and  error  variances. 

Consider  the  graph  G  with  five  latent  variables  Lj,  1  <  %  <  5,  where  L4  has  X\  and  Y\  as 
its  only  children,  Ad  is  the  only  child  of  L2,  X3  is  the  only  child  of  L3,  Y2  is  the  only  child 
of  L4  and  Y3  is  the  only  child  of  L5.  Also,  {Xi,  X2,  X3,  Y\,  Y2l Y3},  as  defined  in  CS3,  are 
the  only  observed  variables,  and  each  observed  variable  has  only  one  parent  besides  its  error 
term.  Error  variables  are  independent. 

The  following  simple  randomized  algorithm  will  choose  a  covariance  matrix  E^  for  {Li,  L2, 
L3,L4,L5}  that  entails  CS3.  The  symbol  will  denote  the  covariance  of  L.t  and  Lj. 

1.  Choose  positive  random  values  for  all  <Ja,  1  <  i  <  5 

2.  Choose  random  values  for  a12  and  cr13 

3.  <^23  ^  o~i2(j13/aii 

4.  Choose  random  values  for  a45,  a25  and  a24 

5.  o  44 <7  42a  43 1  a  23 

6-  CTis  <  ^12(745/ c24 

7-  Cr35  0‘i30'45/a'i4 

8.  CT34  <  eT\2<743/ (J15 

9.  Repeat  from  the  beginning  if  YL  is  not  positive  dehnite  or  if  o-14a23  =  cr12cr34 

Table  8  provides  an  example  of  such  matrix.  □. 
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Theorem  2  There  is  some  T,L  such  that  LTM(E)  and  STM(E,T,L )  are  not  equal. 

Proof:  Follows  immediately  from  Lemmas  6  and  7.  □ 

Before  proving  Theorem  3,  we  will  introduce  several  lemmas  that  will  be  used  in  the 
Theorem  proof. 

Lemma  8  Let  G( O)  be  a  latent  variable  graph  where  no  pair  in  O  is  marginally  uncorrelated, 
and  let  {X,  Y}  C  O.  If  there  is  no  pair  {P,  Q}  C  O  such  that  (Jxy^pq  =  Tv p°yq  holds, 
then  there  is  at  least  one  graph  in  the  tetrad  equivalence  class  of  G  where  X  and  Y  have  a 
common  latent  parent. 

Proof:  ft  will  suffice  to  show  the  result  for  linear  latent  variable  models,  since  they  are  more 
constrained  than  non-linear  ones.  Moreover,  we  will  be  able  to  make  use  of  the  Tetrad  Rep¬ 
resentation  Theorem  and  the  equivalence  of  d-separations  and  vanishing  partial  correlations, 
facilitating  the  proof. 

If  in  all  graphs  in  the  tetrad  equivalence  graph  of  G  we  have  that  X  and  Y  share  some 
common  hidden  parent,  then  we  are  done.  Assume  then  that  there  is  at  least  one  graph  Go 
in  this  class  such  that  X  and  Y  have  no  common  hidden  parent.  Construct  graph  Gq  by 
adding  a  new  latent  and  edges  X  L  — »  Y .  We  will  show  that  G'0  is  in  the  same  tetrad 
equivalence  class,  i.e.,  the  addition  of  the  substructure  X  <—  L  — >  Y  to  G0  does  not  destroy 
any  entailed  tetrad  constraint  (it  might,  however,  destroy  some  independence  constraint). 

Assume  there  is  a  tetrad  constraint  corresponding  to  some  choke  point  {X,  P}  x  {T,Q}. 
If  Y  is  not  an  ancestor  of  T  or  Q,  then  this  tetrad  will  not  be  destroyed  by  the  introduction 
of  subpath  X  < —  L  — >  Y,  since  no  new  treks  connecting  X  or  P  to  T  or  Q  can  be  formed, 
and  therefore  no  choke  point  {X,  P}  x  {T,  Q}  will  disappear. 

Assume  without  loss  of  generality  that  Y  is  an  ancestor  of  Q.  Since  there  is  a  trek 
connecting  X  to  Q  through  Y  (because  no  marginal  correlations  are  zero)  in  G,  the  choke 
point  {A",  P}  x  {T,  Q}  should  be  in  this  trek.  Let  X  be  the  starting  node  of  this  trek,  and 
Q  the  ending  node.  If  the  choke  point  is  after  Y  on  this  trek,  then  this  choke  point  will  be 
preserved  under  the  addition  of  X  <—  L  — >  Y .  If  the  choke  point  is  Y  or  is  before  Y  on  this 
trek,  then  there  will  be  a  choke  point  {X,  P}  x  {Y,  Q},  a  contradiction  of  the  assumptions. 

One  can  show  that  choke  points  {Y,  P}  x  {T,  Q}  are  also  preserved  by  an  analogous 
argument.  □ 


Lemma  9  Let  G(O)  be  a  linear  latent  variable  graph,  and  let  O'  =  { A ,  B,  G,  D}  C  O.  If  all 
elements  in  O'  are  marginally  correlated,  and  a  choke  point  CP  =  {A,  G}  x  {B,D}  exists, 
and  CP  is  in  all  treks  connecting  elements  in  {A,  B ,  G,  D},  then  no  two  elements  {X1;  X2}, 
Xi  G  {A,  C},  X2  G  {B,  D},  are  both  connected  to  CP  in  G  by  treks  into  CP. 

Proof:  By  the  Tetrad  Representation  Theorem,  CP  should  be  either  on  the  { A,C }  or  the 
{B,  D}  side  of  every  trek  connecting  elements  in  these  two  sets.  For  the  sake  of  contradiction, 
assume  without  loss  of  generality  that  A  and  B  are  connected  to  CP  by  some  treks  into  CP. 
Since  <jab  ^  0,  CP  has  to  be  an  ancestor  of  either  A  or  B.  Without  loss  of  generality,  let 
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CP  be  an  ancestor  of  B.  Then  there  is  at  least  one  trek  connecting  A  and  B  such  that  CP 
is  not  on  the  {A,  C}  side  of  it:  the  one  connecting  CP  and  A  that  is  into  CP  and  continues 
into  B. 

If  CP  is  an  ancestor  of  C,  then  there  is  at  least  one  trek  connecting  C  and  B  such  that 
CP  is  not  in  the  { B ,  D}  side  of  it:  the  one  connecting  CP  and  B  that  is  into  CP  and 
continues  into  C.  But  this  cannot  happen  by  the  definition  of  choke  point.  If  CP  is  not 
an  ancestor  of  C,  CP  has  to  be  an  ancestor  of  A,  or  otherwise  there  would  be  no  treks 
connecting  A  and  C  (since  CP  is  in  all  treks  connecting  A  and  C  by  hypothesis,  and  at 
least  one  exists,  because  gac  ^  0).  This  implies  at  least  one  trek  connecting  A  and  B  such 
that  CP  is  not  on  the  {B,  D}  side  of  it:  the  one  connecting  CP  and  B  that  is  into  CP  and 
continues  into  A.  Contradiction.  □ 


Lemma  10  Let  G( O)  be  a  linear  latent  variable  graph,  and  let  O'  =  { A ,  B,  C,  D,  E}  C 
O.  If  all  elements  in  O'  are  marginally  correlated,  and  constraints  gab&cd  =  &ad&bc, 
vac&de  =  cae^cd  and  gbc&de  =  obd&ce  hold,  then  all  three  tetrad  constraints  hold  in 
the  covariance  matrix  of  { A ,  P,  C,  D}. 

Proof:  By  the  Tetrad  Representation  Theorem,  let  C P\  be  a  choke  point  {A,  C}  x  { II.  I) } , 
which  is  known  to  exist  in  G  by  assumption.  Let  CP2  be  a  choke  point  {A,  D}  x  {C,  E}, 
which  is  also  assumed  to  exist.  From  the  definition  of  choke  point,  all  treks  connecting  C 
and  D  have  to  pass  through  both  CP\  and  CP2.  We  will  assume  without  loss  of  generality 
that  none  of  the  choke  points  we  introduce  in  this  proof  are  elements  of  {A,  P,  C,  D ,  E}. 

First,  we  will  show  by  contradiction  that  all  treks  connecting  A  to  C  should  include  CP\. 
Assume  that  A  is  connected  to  C  through  a  trek  T  that  includes  CP2  but  not  CP3.  Let  Tf 
be  the  subtrek  A  —  CP2,  i.e.,  the  subtrek  of  T  connecting  A  and  CP2.  Let  P2  be  the  subtrek 
CP2  —  C .  Neither  T\  or  P2  contain  CPi,  and  they  should  not  collide  at  CP2  by  definition. 
Notice  that  a  trek  like  T  should  exist,  since  CP2  has  to  be  in  all  treks  connecting  A  and 
C,  and  at  least  one  such  trek  exists  because  gac  ¥"  0-  Any  subtrek  connecting  CP2  to  D 
that  does  not  intersect  T2  elsewhere  but  in  CP2  has  to  contain  C P\ .  Let  T3  be  the  subtrek 
between  CP2  and  CP\ .  Let  T4  be  a  subtrek  between  CP\  and  B.  Let  T5  be  the  subtrek 
between  CP\  and  D.  This  is  illustrated  by  Figure  11(a).  (P  and  D  might  be  connected  by 
other  treks,  simbolized  by  the  dashed  edge.) 

Now  consider  the  choke  point  CP3  =  {P,  E}  x  { C ,  D}.  Since  CP3  is  in  all  treks  connecting 
P  and  C,  CP3  should  be  either  on  T2,  T3  or  P4.  If  CP3  is  on  P4  (Figure  11(b)),  then  there  will 
be  a  trek  connecting  D  and  E  that  does  not  include  CP2,  which  contradicts  the  definition 
of  choke  point  { A ,  D}  x  { C ,  E},  unless  both  P  —  CP\  and  D  —  CP\  are  into  CP\ .  However, 
if  both  P  —  CP\  and  D  —  CP\  (i.e.,  T4  and  T5)  are  into  CP\ ,  then  CP\  —  CP2  is  out  of  CP\ 
and  into  CP2)  since  T2  —  T3  —  T5  is  a  trek  by  construction,  and  therefore  cannot  contain  a 
collider.  Since  D  is  an  ancestor  of  CP2  and  CP2  is  in  a  trek  connecting  E  and  D,  then  CP2  is 
an  ancestor  of  E.  All  paths  CP2  —>•••—>  E  should  include  CP3  by  definition,  which  implies 
that  CP2  is  an  ancestor  of  CP3.  B  cannot  be  an  ancestor  of  CP3l  or  otherwise  CP3  would  have 
to  be  an  ancestor  of  CP\,  creating  the  cycle  CP3  — >  . . .  CP\  —>•••—>  CP2  —>•••—>  CP3. 
CP3  would  have  to  be  an  ancestor  of  P,  since  P  —  CP3  —  C P\  is  assumed  to  be  a  trek  into 
CP\  and  CP3  is  not  an  ancestor  of  CP\  (Figure  11(c)).  If  CP3  is  an  ancestor  of  P,  then 
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Figure  11:  Several  illustrations  depicting  cases  used  in  the  proof  of  Lemma  10. 

there  is  a  trek  C  <—  ■  ■  ■  <—  CP2  — >  . . .  CP3  — >  B ,  which  does  not  include  C  P\ .  Therefore, 
CP-3  is  not  in  T4. 

If  CP-3  is  in  T3,  B  and  D  should  both  be  ancestors  of  C P\ ,  or  otherwise  there  will  be  a  trek 
connecting  them  that  does  not  include  CP3.  Again,  this  will  imply  that  CP\  is  an  ancestor 
of  CP-2-  If  some  trek  E  —  CP3  is  not  into  CP3,  then  this  creates  a  trek  D  —  CP\  —  CP3  —  E 
that  does  not  contain  CP2,  contrary  to  our  hypothesis.  If  every  trek  E  —  CP3  is  into  CP3, 
then  some  other  trek  CP3  —  D  that  is  ont  of  CP3  but  does  not  include  CP\  has  to  exist.  But 
then  this  creates  a  trek  connecting  C  and  D  that  does  not  include  CP\,  which  contradicts 
the  definition  CP\  =  {A,  C}  x  {B,  Dj.  A  similar  reasoning  forbids  the  placement  of  CP3  in 
T2. 

Therefore,  all  treks  connecting  A  and  C  should  include  CP\ .  We  will  now  show  that 
all  treks  connecting  B  and  D  should  also  include  C P\ .  We  know  that  all  treks  connecting 
elements  in  {A,  C,  D}  go  through  CP\.  We  also  know  that  all  treks  between  {B,E}  and 
{C,  D}  go  through  CP3 .  This  is  illustrated  by  Figure  11(d).  A  possible  trek  from  CP3  to  D 
that  does  not  include  CP\  (represented  by  the  dashed  edge  connecting  CP 3  and  D)  would 
still  have  to  include  CP2,  since  all  treks  in  {A,  D}  x  {C,  E}  go  through  CP2.  If  CP\  =  CP2, 
then  all  treks  between  B  and  D  go  through  CP\.  If  CP\  ^  CP2)  then  such  CP3  —  D  trek 
without  CP\  but  with  CP2  would  exist,  implying  that  some  trek  C  —  D  without  both  CP\ 
and  CP2  would  exist,  contrary  to  our  hypothesis. 

Therefore,  we  showed  that  all  treks  connecting  elements  in  {A,  B,  C,  D}  go  through  the 
same  point  C P\ .  By  symmetry  between  B  and  E ,  it  is  also  the  case  that  CP\  is  in  all  treks 
connecting  elements  in  {A,  E,C,  D}.  From  this  one  can  verify  that  CP\  =  CP2.  We  will 
show  that  CP\  is  also  a  choke  point  for  {P,  E}  x  {C,  D}  (althought  it  might  be  the  case  that 
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CP  1  7^  CP3).  Because  CP\  =  CP-2 ,  one  can  verify  that  choke  point  CP3  has  to  be  in  a  trek 
connecting  B  and  CP] .  There  is  a  trek  connecting  B  and  CP]  that  is  into  CP\  if  and  only  if 
is  a  trek  connecting  B  and  CP3  that  is  into  CP3.  The  same  holds  for  E.  Therefore,  there  is 
a  trek  connecting  B  and  CP\  that  is  into  CP\  if  and  only  if  there  is  a  trek  connecting  E  and 
CP]  that  is  into  C P\ .  However,  if  there  is  a  trek  connecting  B  and  CP\  into  CP\,  then  there 
is  no  trek  connecting  C  and  CP\  that  is  into  CP\  (because  of  choke  point  {A,  C}  x  { B ,  D } 
and  Lemma  9).  This  also  implies  there  is  no  trek  E  —  CP]  into  CP\,  and  because  CP\  is 
a  {A,  D}  x  {U,  E}  choke  point,  Lemma  9  will  imply  that  there  is  no  D  —  CP\  into  CP\ . 
Therefore,  all  treks  connecting  pairs  { B ,  E}  x  {U,  D}  will  be  either  on  the  { B ,  E}  side  or 
{C,  D}  of  CP\ .  CP]  is  a  {B,  E}  x  {C,  D}  choke  point. 

Because  CP]  is  a  { A,C }  x  {B,D},  {A,  D}  x  {C,E}  and  {B,E}  x  {C,  D}  choke  point, 
then  no  pair  in  {A,  B ,  C,  D}  can  be  connected  to  CP]  by  a  trek  into  CP\.  This  implies 
that  CP]  d-separates  all  elements  in  { A ,  B,  C ,  D}  and  therefore  CP\  is  a  choke  point  for  all 
tetrads  in  this  set.  □ 


Lemma  11  Let  G( O)  be  a  linear  latent  variable  graph,  and  let  O'  =  {A,  B,C,  D,  E}  C 
O.  If  all  elements  in  O'  are  marginally  correlated,  and  constraints  gab^cd  =  vad&bc, 
vacOde  =  &AE&CD  and  gbegdc  =  gbdgce  hold,  then  all  three  tetrad  constraints  hold  in 
the  covariance  submatrix  formed  by  any  foursome  in  {A,  B ,  C,  D ,  E}. 

Proof:  As  in  Lemma  10,  let  CP\  be  a  choke  point  {A,  C}  x  {B,  D },  and  let  CP2  be  a  choke 
point  {A,  D}  x  {C,E}.  Let  CP3  be  choke  point  {B,C}  x  {D,E}. 

We  first  show  that  all  treks  between  C  and  A  go  through  CP] .  Assume  there  is  a  trek 
connecting  A  and  C  through  CP2  but  not  CP] ,  analogous  to  Figure  11(a).  Let  T4, . . .  ,T5 
be  defined  as  in  Lemma  10.  Since  all  treks  between  C  and  D  go  through  CP3 ,  choke  point 
CP3  should  be  either  at  T2,  T3  or  T4. 

If  CP3  is  at  T2  or  T3,  then  treks  B  and  D  should  collide  at  CP] ,  or  otherwise  there 
will  be  a  trek  connecting  B  and  D  that  does  not  include  CP3.  This  implies  that  CP]  is  an 
ancestor  of  CP3.  If  there  is  a  trek  connecting  D  and  CP3  that  intersects  T2  or  T3  not  at 
CP] ,  then  there  will  be  a  trek  connecting  C  and  D  that  does  not  include  CP] ,  which  would 
be  a  contradiction.  If  there  is  no  such  a  trek  connecting  D  and  CP3 ,  then  CP3  cannot  be  a 
{B,C}  x  { D ,  E}  choke  point.  If  CP3  is  at  T4,  a  similar  case  will  follow. 

Therefore,  all  treks  connecting  A  and  C  include  CP] .  By  symmetry  between  {A,  B,  E} 
and  {C,  D},  CP]  is  in  all  treks  connecting  any  pair  in  {A,  B,C,  D,  E}.  Using  the  same  ar¬ 
guments  of  Lemma  10,  one  can  show  that  CP\  is  a  choke  point  for  any  foursome  in  this  set.  □ 


Lemma  12  Let  G(O)  be  a  linear  latent  variable  graph,  and  let  O'  =  {A,  B ,  C,  D,  E}  C 
O.  If  all  elements  in  O'  are  marginally  correlated,  and  constraints  gabCcd  =  &ad&bc, 
gac&de  =  gaeGcd  and  gab^ce  =  gac&be  hold,  then  all  three  tetrad  constraints  hold  in 
the  covariance  matrix  of  {A,  C ,  D,  E}. 

Proof:  As  in  Lemmas  10  and  11,  let  CP]  be  a  choke  point  {A,U}  x  {B,D},  and  let  CP2 
be  a  choke  point  {A,  D}  x  {U,  E}.  Let  CP3  be  a  choke  point  {A,  E}  x  {B,  Cj.  We  will  first 
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show  that  all  treks  connecting  A  and  C  either  go  through  CP\  or  all  treks  connecting  A  and 
D  go  through  CP2. 

As  in  Lemma  10,  all  treks  connecting  C  and  D  contains  CP\  and  CP2.  Let  T  be  one  of 
these  treks.  Assuming  that  A  and  C  are  connected  by  some  trek  that  does  not  contain  CP\ 
(but  must  contain  CP2)  implies  a  family  of  graphs  represented  by  Figure  11(a). 

Since  there  is  a  choke  point  CP3  =  {A,  E}  x  {B,C},  the  only  possible  position  for  CP3 
in  Figure  11(a)  is  in  trek  A  —  CP2.  If  CP2  7^  CP3,  then  no  choke  point  { A ,  D}  x  {C,  E}  can 
exist,  since  CP3  is  not  in  T.  Therefore,  either  all  treks  between  A  and  C  contain  CP\ ,  or 
CP2  =  CP3. 

If  the  first  case  holds,  a  similar  argument  will  show  that  all  treks  between  any  element 
in  {A,  C,  D}  and  node  E  will  have  to  go  through  CP\.  If  the  second  case  holds,  a  similar 
argument  will  show  that  all  treks  between  any  element  in  {A,  C ,  D}  and  node  E  will  have 
to  go  through  CP2. 

Therefore,  there  is  a  node  CP  such  that  all  treks  connecting  elements  in  {A,  C,  D,  E}  go 
throught  some  choke  point.  Similarly  to  the  proof  of  Lemma  10,  using  Lemma  9,  the  given 
tetrad  constraints  will  imply  that  CP  is  a  choke  point  for  all  tetrads  in  {A,  C,  D ,  E}  for  both 
cases  CP  =  CP\  and  CP  =  CP2.  □ 

Theorem  3  There  is  no  locally  sound  tetrad  constraint  set  of  domain  size  less  than  6  for 
deciding  if  two  nodes  A  and  B  do  not  have  a  common  parent  in  a  latent  variable  graph  G,  if 
P.Y1X2.V3  7^  0  and  pxxx2  ¥"  0  for  {Ah,  Ah}  in  the  domain  of  the  constraint  set  and  observed 
variable  X3. 

Proof:  It  will  suffice  to  show  the  result  for  linear  latent  variable  models,  since  they  are  more 
constrained  than  non-linear  ones.  Moreover,  we  will  be  able  to  make  use  of  the  Tetrad  Rep¬ 
resentation  Theorem  and  the  equivalence  of  d-separations  and  vanishing  partial  correlations, 
facilitating  the  proof. 

This  is  trivial  for  domains  of  size  2  and  3,  where  no  tetrad  constraint  can  hold.  For 
domains  of  size  4,  let  X  =  { A ,  B ,  C,  D}  be  our  four  variables.  We  will  show  that  it  does  not 
matter  which  tetrad  constraints  hold  among  these  four  variables  (excluding  logically  incon¬ 
sistent  constraints),  there  exist  two  linear  latent  variable  graphs  with  observable  variables 
{A,  B,  C,  D},  C  and  G",  where  in  the  former  A  and  B  do  not  share  a  parent,  while  in  latter 
they  do  have  a  parent  in  common.  This  will  be  the  main  technique  used  during  the  entire 
proof.  Another  technique  is  showing  that  some  combinations  of  tetrad  constraints  will  result 
in  contradictory  assumptions  about  existing  constraints,  and  therefore  we  do  not  need  to 
create  the  G'  and  G"  graphs  corresponding  to  these  sets. 

By  Lemma  8,  if  we  do  not  have  any  tetrad  corresponding  to  a  choke  point  {A,  V\}  x 
{B,V 2},  then  the  result  follows  immediately.  We  therefore  consider  only  the  cases  where 
the  tetrad  constraint  corresponding  to  choke  point  {A,  C}  x  { B,D }  exists,  without  loss  of 
generality.  This  assumption  will  be  used  during  the  entire  proof. 

Bi-directed  edges  X  Y  will  be  used  as  a  shorthand  representation  for  the  path  X  <— 
L  — >  Y,  where  L  is  some  new  latent  independent  of  its  non-children. 

Suppose  first  that  all  possible  three  tetrad  constraints  hold  in  the  covariance  matrix  E  of 
{A,  B,C,  D},  i.e.,  (Jab^cd  =  cac^bd  =  &AD&BC-  Let  G'  have  two  latent  nodes  L\  and  L2, 
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where  Li  is  a  common  parent  of  A  and  L2,  and  L2  a  parent  of  B,C  and  D.  Let  G"  have  a 
latent  node  Lx  as  the  only  parent  of  A,  B,C  and  D,  and  no  other  edges,  and  the  result  will 
follow  for  this  case. 

Suppose  now  only  one  tetrad  constraint  holds  instead  of  all  three,  i.e.,  the  one  entailed 
by  a  choke  point  between  pairs  {A,G}  x  {B,D}  (the  analogous  case  would  be  the  pairs 
{AD}  x  {B,C}).  Create  G'  again  by  using  two  latents  Li  and  L2,  making  L2  a  parent  of 
B  and  D ,  and  making  Li  a  parent  of  L2l  A  and  C .  Create  G"  from  G',  by  adding  the  edge 
Li  — >  B. 

Now  suppose  our  domain  X  =  {A,  B,  G,  D ,  E}  has  five  variables,  where  E  will  now  de¬ 
note  the  covariance  matrix  of  X.  Again,  we  will  show  how  to  build  graphs  G'  and  G"  in 
all  possible  consistent  combinations  of  vanishing  and  non-vanishing  tetrad  constraints.  This 
case  is  more  complicated,  and  we  will  divide  it  in  several  major  subcases.  Each  subcase 
will  have  an  sub-index,  and  each  sub-index  inherits  the  assumptions  of  higher-level  indices. 
Some  results  about  entailmcnt  of  tetrad  constraints  are  stated  without  explicit  detail:  they 
can  be  derived  directly  by  a  couple  of  algebraic  manipulations  of  tetrad  constraints  or  from 
Lemmas  10,  11  and  12. 

Case  1:  There  are  choke  points  {A,  C}  x  {B,D}  and  {A,  B}  x  {C,  D}.  We  know  from 
the  assumption  of  existence  of  a  choke  point  {A,C}  x  {B,D}  and  results  from  Chapter  3 
that  this  is  equivalent  of  having  a  latent  variable  d-separating  all  elements  in  {A,  B ,  C,  D}. 
Let  Go  be  as  follows:  let  L\  and  L2  be  two  latent  variables,  let  L\  be  a  parent  of  {A,L2}, 
and  let  L2  be  a  parent  of  {B,C,  D,  E}.  We  will  construct  G'  and  G"  from  G0,  considering 
all  possible  combinations  of  choke  points  of  the  form  {Vj,  V2}  x  {V3,E}. 

Case  1.1:  there  is  a  choke  point  {A,G}  x  {D,E}. 

Case  l.l.Lthere  is  a  choke  point  {A,  D}  x  {C,  E}.  As  before,  this  implies  a  choke 
point  {A,  E}  x  {C,D}.  We  only  have  to  consider  now  choke  points  of  the  form 
{Ad,  B}  x  {X2,  E}  and  {Ad,  Ad}  x  {B,  E}.  From  the  given  constraints  ctbd&ac  —  ^bc^ad 
(choke  point  {A,  B}  x  {C,  £)})  and  cr de&ac  —  poe^ad  (choke  point  {A,  E}  x  {G,  D}),  we 
have  <j bd&ce  =  cbc&de,  a  { B ,  E}  x  {G,  D}  choke  point.  Choke  points  { B ,  E}  x  {A,  C}  and 
{■ B,E }  x  {A,  D}  will  follow  from  this  conclusion.  Finally,  if  we  assume  also  the  existence  of 
some  choke  point  {Ad,  B}  x  {Ad,  A},  then  all  choke  points  of  this  form  will  exist,  and  one 
can  let  G'  =  Go-  Otherwise,  if  there  is  no  choke  point  {Xi,  B}  x  {Ad,  E},  let  G'  be  Go  with 
the  added  edge  B  E.  Construct  G"  by  adding  edge  L2  ►  A  to  G' . 

Case  1.1.2:there  is  no  choke  point  {A,  D}  x  {C,  E}.  Choke  point  { A ,  E}  x  {C,  D}  can¬ 
not  exist,  or  this  will  imply  {A,  D}  x  {G,  E}.  We  only  have  to  consider  now  choke 
points  of  the  form  {Ad,  B}  x  {Ad ,  E}  and  {Ad,  Ad}  x  {B,  E}.  Choke  point  {A,  C}  x 
{£>,  E}  is  entailed  to  exist,  since  the  single  choke  point  that  d-separates  foursome  { A ,  B,  G,  D} 
has  to  be  the  same  choke  point  for  {A,  G}  x  {D,  E}  and  therefore  a  choke  point  for  {A,  G}  x 
{B,  E}.  No  choke  point  {Ad,  D}  x  {X2,  E}  can  exist,  for  Ad  G  {A,  B,  G},  i  =  1,2:  otherwise, 
from  the  given  choke  points  and  {Ad ,  D}  x  {X2,  E},  one  can  verify  that  {A,  D}  x  {G,  E} 
would  be  generated  using  combinations  of  tetrad  constraints.  We  only  have  to  consider 
now  choke  points  of  the  form  {Ad ,  B}  x  {X2,E}.  Choke  points  {B,C}  x  {A,E}, 
{B,C}  x  {D,E}:  {A,  B}  x  {C,  E}  and  {A,  B}  x  {D,E}  either  all  exist  or  none  exists.  If 
all  exist,  let  G'  =  Gq  with  the  extra  edge  D  E.  If  none  exists,  let  G'  =  Gq  and  add  both 
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A  A  and  D  E  to  G".  Let  G"  be  G'  with  the  extra  edge  L2  — >  A. 

Case  1.2:  there  is  no  choke  point  {A,  C}  x  {B.B}. 

Case  1.2.1:there  is  a  choke  point  { A ,  D}  x  {G,  A}.  This  case  is  analogous  to  Case  1.1.2 
by  symmetry  within  {A,  B,  C,  D}. 

Case  1.2.2:there  is  no  choke  point  {A,  D}  x  {C,  E}.  Assume  first  there  is  no  choke 
point  {A,  E}  x  {C,  D}.  We  only  have  to  consider  now  choke  points  of  the  form 

{Xi,B}  x  {X2,A}  and  {Xi,X2}  x  {A,  A}.  At  most  one  of  the  choke  points  {Ad,  B}  x 
{Ad,  E}  can  exist.  Otherwise,  any  two  of  them  will  entail  either  { A ,  D}  x  {C,  E},  {A,  G}  x 
{D,  E}  or  {A,  E}  x  {G,  D}  by  Lemmas  10,  11  or  12.  Analogously,  no  choke  point  {Ad,  Ad}  x 
{B,  E}  can  exist. 

Without  loss  of  generality,  let  {A,  B}  x  {D,  E}  be  the  only  possible  extra  choke  point. 
Create  G'  by  adding  edges  G  <->•  E  and  D  <->•  E  to  G0.  Create  G"  by  adding  edge  L2  A 
to  G' .  For  the  case  where  no  other  choke  point  exists,  create  G'  by  adding  edges  A  <->•  E, 
B  <->•  E,  C  E  and  D  <->•  E  to  Go-  Create  G"  by  adding  edge  L2  — >  A  to  G'. 

Assume  now  there  is  a  choke  point  {A,  A}  x  {G,  A}.  We  only  have  to  consider  now 
choke  points  of  the  form  {Ad,  B}  x  {Ad,  A}  and  {Ad,  Ad}  x  {A,  A}.  No  {A,  A}  x  {Ad,  A} 
choke  point  can  exist,  or  by  Lemmas  10,  11  or  12  and  the  given  tetrad  constraints,  some 
{A,  Ad}  x  {A,  Ad}  choke  point  will  be  entailed. 

Choke  point  {A,  C}  x  {A,  A}  exists  if  and  only  if  {A,  A}  x  {C,  A}  exists,  can  exist. 
If  both  exist,  create  G'  by  adding  edges  A  A  to  Go-  Create  G"  by  adding  edge  L2  — > ►  A 
to  G".  If  none  exists,  create  G'  by  adding  edges  A  <->•  A  and  A  <->•  A  to  Go-  Create  G"  by 
adding  edge  L2  — >  A  to  G'. 

Case  2:  There  is  a  choke  point  {A,  G}  x  {A,  A},  but  no  choke  point  {A,  A}  x  {G,  A}. 

Case  2.1:  there  is  a  choke  point  {A,C}  x  {A,  A},. 

Case  2.1.1:  there  is  a  choke  point  {A,  A}  x  {C,  A}.  As  before,  this  implies  a  choke 
point  {A,  A}  x  {C,  A}.  We  only  have  to  consider  now  choke  points  of  the  form 
{Ad,  A}  x  {Ad,  A}  and  {Ad,  Ad}  x  {A,  A}.  The  choke  point  {A,  G}  x  {A,  A}  is  implied. 
No  choke  point  {A,  A}  x  {Ad,  A}  can  exist,  or  otherwise  {A,  A}  x  {G,  A}  will  be  implied.  For 
the  same  reason,  no  choke  point  {A,  Ad}  x  {A,  A}  can  exist.  We  only  have  to  consider 
now  subsets  of  the  set  of  constraints  {{A,  A}  x  {G,  A},  {G,  A}  x  {A,  A}}.  The  existence 
of  {A,  A}  x  {G,  A}  implies  {G,  A}  x  {A,  A}.  We  only  need  to  consider  either  both  or  none. 

Suppose  none  of  these  two  constraints  hold.  Create  G'  with  two  latents  Li,L2.  Let 
Li  be  a  parent  of  {A,L2},  let  L2  be  a  parent  of  {A,  G,  A,  A}.  Add  the  bi-directed  edge 
A  A.  Add  the  bi-directed  edge  A  A.  Create  G"  out  of  G'  by  adding  edge  L2  — >  A. 
Now  suppose  both  constraints  hold.  Create  G'  with  two  latents  L±,L2.  Let  L\  be  a  parent 
of  {A,  L2},  let  L2  be  a  parent  of  {A,  G,  A,  A}.  Add  the  bi-directed  edge  A  <-»•  A.  Create  G" 
out  of  G'  by  adding  edge  L2  — >  A. 

Case  2.1.2:  there  is  no  choke  point  {A,  A}  x  {C,  A}.  Since  there  is  a  choke  point 
{A,  G}  x  {A,  A}  by  assumption  2.1,  there  is  no  choke  point  {A,  A}  x  {G,  A}  or  other¬ 
wise  we  get  a  contradiction.  Analogously,  because  there  is  a  {A,G}  x  {A,  A}  choke  point 
but  no  {A,  A}  x  {G,  A}  (assumption  2),  we  cannot  have  a  {A,  A}  x  {A,G}  choke  point. 
This  covers  all  choke  points  within  sets  {A,  A,G,  A}  and  {A,  G,  A,  A}.  We  only  have  to 
consider  now  choke  points  of  the  form  {A A,  A}  x  {X2,E}  and  {X1:X2}  x  {B.E}. 
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From  <JABcrCD  =  crADaBC  (choke  point  {A,  C}  x  { B,D })  and  aAEaCD  =  cfADaCE 
(choke  point  {A,  C}  x  {D,E})  one  gets  c abctce  =  &ae&bc,  he.,  a  {B,  E}  x  {A,  C}  choke 
point.  Choke  point  {B,  E}  x  {A,  D}  exists  if  and  only  if  {B,  E}  x  {C,  D}  exists:  to  see  how 
the  former  implies  the  latter,  use  the  tetrad  constraint  from  {B,  E}  x  {A,  C}.  Therefore,  we 
have  two  subcases. 

Case  2. 1.2.1:  there  are  choke  points  {B,  E}  x  {A,  D}  and  {B,E}  x  {C,D}.  We 
only  have  to  consider  now  choke  points  of  the  form  {Ah,  B}  x  { X2 ,  E}.  No  choke  point 
{B,  A}  x  {C,  E}  and  {B,  C}  x  {A,  E}  can  exist  (one  implies  the  other,  since  we  have  { B ,  E}  x 
{A,  C},  and  all  three  together  with  the  given  choke  points  will  generate  {A,  B}  x  {C,  D}, 
excluded  by  assumption).  Choke  points  {B,C}  x  {D,E}  and  {B,D}  x  {C,E}  either  both 
exist  or  both  do  not  exist.  The  same  holds  for  pair  {{ B ,  A}  x  {D,E},{B,D}  x  {A,  E}}.  Let 
G'  be  a  graph  with  two  latents,  Lt,  L2,  where  Li  is  a  parent  of  {L2,  A,  C}  and  L2  is  a  parent  of 
{B,  D,  E}.  Add  bi-directed  edge  B  ^  D  for  cases  where  {B,  C}  x  {D,  E},  {B,  D}  x  { C ,  E} 
do  not  exist.  Add  bi-directed  edge  B  ^  E  for  cases  where  { B ,  A}  x  {D,  E},  { B ,  D}  x  {A,  E} 
do  not  exist.  Let  G"  be  formed  from  G'  with  the  addition  of  L\  — >  B. 

Case  2. 1.2. 2:  there  are  no  choke  points  {B,  E}  x  {A,  D}  and  {B,E}  x  {C,D}.  We 
only  have  to  consider  now  choke  points  of  the  form  {Ad,  B}  x  {X2,E}.  Using  the 
tetrad  constraint  implied  by  choke  point  {A,  C}  x  {D,  E},  one  can  verify  that  {A,  B}x  {D,  E} 
holds  if  and  only  if  {B,C}  x  {D,E}  holds  (call  pair  {{A,  B}  x  {D,  E},  {B,  C}  x  {D,  E}} 
Pair  1).  From  the  given  {B,  E}  x  {A,G},  we  have  that  {A,  B}  x  {C,E}  holds  if  and  only 
if  {B,C}  x  {A,  E}  holds  (call  it  Pair  2).  Using  the  given  tetrad  constraint  corresponding  to 
{A,  G}  x  {B,  D}:  one  can  show  that  {B,  D}  x  {A,  E}  holds  if  and  only  if  {B,  D}  x  {G,  E} 
(call  it  Pair  3).  We  can  therefore  partition  all  six  possible  {Ah,  B}  x  {Ah,  E}  into  these  three 
pairs.  Moreover,  if  Pair  1  holds,  none  of  the  other  two  can  hold,  because  Pair  1  and  Pair  2 
together  imply  {B,  E}  x  {A,D}.  Pair  1  and  Pair  3  together  imply  {B,  E}  x  {C,  D}. 

If  neither  Pair  holds,  construct  G'  as  follows.  Let  Go  be  the  latent  variable  graph 
containing  three  latents  Li,L2,L3  where  L i  is  a  parent  of  {A,  C,  L2},  L2  is  a  parent  of 
{B,L3}  and  L3  is  a  parent  of  {D,E}.  Let  G'  be  G0  with  the  added  edges  B  D  and 
B  <->  E.  If  Pair  1  alone  holds,  let  G'  be  as  Go-  In  both  cases,  let  G"  be  G'  with  the  added 
edge  Li  — >  B. 

If  Pair  2  holds,  but  not  Pair  3  (nor  Pair  1),  construct  G'  as  follows.  Let  Go  be  a 
latent  variable  graph  with  two  latents  L\  and  L2l  where  L\  is  a  parent  of  L2  and  A,  and  L2 
is  a  parent  of  { B ,  G,  D,  E}.  Let  G'  be  Go  augment  with  edges  B  D  and  B  E.  If  Pairs 
2  and  3  hold  (but  nor  Pair  1),  let  G'  be  Go  with  the  extra  edge  B  D.  In  both  cases,  let 
G"  be  G'  with  the  extra  edge  L2  — >  A.  If  Pair  3  holds  but  not  Pair  2  (nor  Pair  1),  let  G' 
have  three  latents  L\,  L2,  L3)  where  L\  is  a  parent  of  L2  and  A,  L2  is  a  parent  of  L3  and  G, 
and  L3  is  a  parent  of  B ,  D  and  E.  Let  G"  be  as  G'  but  with  the  extra  edge  L3  — >  L\. 

Case  2.2:  there  no  a  choke  point  {A,  C}  x  {D,E}. 

Case  2.2.1:  there  is  a  choke  point  {A,  D}  x  {C,  E}.  Because  of  the  choke  points  that 
are  assumed  not  to  exist,  it  follows  immediately  that  choke  points  {A,  D}  x  { B ,  C},  {A,  E}  x 
{C,  D}  cannot  exist.  We  only  have  to  consider  now  choke  points  of  the  form 
{Ad,  B}  x  {X2,  E}  and  {Ah,  Ah}  x  {B,E}.  The  choke  point  {A,  D}  x  {B,E}  cannot 
exist,  or  otherwise  when  it  is  combined  with  choke  point  {A,  D}  x  {C,  E},  it  will  generate 
a  contraint  corresponding  to  choke  point  {A,H}  x  {B,C},  which  is  assumed  not  to  exist. 


Similarly,  {A,  C}  x  {B,  E}  cannot  exist  because  the  existence  of  {A,  C}  x  { B ,  D}  will  imply 
{A,  C}  x  { D ,  E}.  No  choke  point  { B ,  E\  x  {C,  D}  can  exist  either.  This  follows  from  choke 
points  {A,  C}  x  {B,D},  {A,  D}  x  {C,E},  which  with  {B,  E}  x  {C,  D}  entail  choke  point 
{A,  B}  x  {C,D}  (Lemma  10),  which  is  assumed  not  to  exist. 

We  only  have  to  consider  now  choke  points  of  the  form  {Ad,  B}  x  {X2,E}. 
Choke  points  {B,C}  x  {D,E}  and  {B,D}  x  {C,E}  are  automatically  excluded  because  of 
{A,  C}  x  {B,  D},  {A,  D}  x  {C,  E}  and  Lemma  11.  Combining  choke  point  {A,  B}  x  {C,  E} 
with  choke  point  {A,  D}  x  {C,  E}  will  generate  a  choke  point  {B,D}  x  {C,  E},  which  we 
just  discarded.  Therefore,  there  is  no  choke  point  {A,  B}  x  {C,E}.  Combining  choke  point 
{£>,.0}  x  {A,  E}  with  choke  point  {A,C}  x  {B,  D}  will  generate  a  choke  point  {B,D}  x 
{C,  E},  which  we  just  discarded.  Therefore,  there  is  no  choke  point  {B,D}  x  {A,  E}. 
Combining  choke  point  {B,C}  x  {A,  E}  with  {A,  C}  x  {B,D}  and  {A,D}  x  {C,E}  using 
Lemma  12  will  result  in  a  choke  point  {A,  E}  x  {C,  D},  which  is  discarded  by  hypothesis. 
Therefore,  there  is  no  choke  point  {B,  C}  x  { A ,  E}.  Combining  choke  point  {A,  B}  x  {D,E} 
with  {A,C}  x  {B,D}  and  {A,  D}  x  {C,  E}  using  Lemma  12  will  result  in  a  choke  point 
{A,  B}  x  {C,  D},  which  is  discarded  by  hypothesis.  Therefore,  there  is  no  choke  point 
{A,  B}  x  {D,  E}. 

This  means  our  model  can  entail  only  tetrad  constraints  generated  by  {A,  C}  x  { B ,  D} 
and  {A,.D}  x  {C,  E}.  Let  G'  have  two  latent  variables  L\  and  L2.  Make  L\  the  parent  of 
{A,  C,  E,  L2}.  Let  L2  be  the  parent  of  B  and  D.  Add  bi-directed  edges  B  <->•  E.  Let  G"  be 
G'  with  the  added  edge  L2  — >  A. 

Case  2.2.2:  there  is  no  choke  point  {A,  D}  x  {C,  E}.  As  before,  both  {A,  B}x  {C,  D} 
and  {A,£>}  x  {B,C}  are  forbidden.  We  consider  two  possible  scenarios  for  choke  point 
{A,E}x{C,D}. 

Case  2. 2. 2.1:  there  is  a  choke  point  {A,  E}  x  {C,  D}.  We  only  have  to  consider 
now  choke  points  of  the  form  {A \,B}  x  {X2,E}  and  {AR,  X2}  x  {B,E}.  Choke  point 
{£>,  E}  x  {C,  D }  does  not  exist,  because  this  combined  with  {A,  E}  x  {C,  D }  will  result 
in  {A,  B}  x  {C,D},  excluded  by  assumption.  {B,E}  x  {A,  D}  cannot  exist  either:  to  see 
this,  start  from  the  constraint  set  {{A,C}  x  {B,  D},{A,E}  x  {C,D},{B,E}  x  {A,  D}. 
Exchanging  the  labels  of  D  and  E,  followed  by  the  exchange  of  E  and  C,  this  is  equivalent 
to  {{A,E}  x  {B,C},  {A,  D}  x  {C,E},{B,D}  x  {A,  C}.  From  Lemma  12,  the  constraint 
{B,D}  x  {£,  C}  is  generated.  Reverting  the  substitutions  of  E  and  C ,  and  E  and  D ,  this  is 
equal  to  {B,  E}  x  {C,  D}  in  the  original  labeling,  which  was  ruled  out  at  the  beginning  of  this 
paragraph.  A  similar  reasoning  rules  out  { B ,  E}  x  {A,  C*}.  We  only  have  to  consider  now 
choke  points  of  the  form  (AR,  B}  x  {X2,  E}.  Choke  point  {A,  B}  x  {D,  E}  cannot  exist. 
Given  the  assumed  choke  point  set  {A,  C }  x  {B,  D},  {A,  E}  x  {C,  D},  {A,  B}  x  {D,  E},  by 
exchanging  labels  A  and  C,  one  obtains  {A,  C}x  {B,  D },  {A,  D}  x  {C,  E },  {B,  C}x  {D,  E }, 
which  by  Lemma  11  implies  choke  points  among  all  elements  in  {A,  B,  C,  D,  E}.  A  similar 
reasoning  rules  out  all  other  choke  points  of  the  type  {A \,B}  x  {X2,E}.  Construct  G'  as 
follows:  two  latents,  L\  and  L2l  where  L\  is  a  parent  of  A,C,E  and  L2l  and  L2  is  a  parent 
of  B  and  D.  Add  the  bi-directed  edge  B  E.  Construct  G"  by  adding  edge  L\  — >  B  to  G' . 

Case  2. 2. 2. 2:  there  is  no  choke  point  {A,  E}  x  {C,  D}.  We  only  have  to  con¬ 
sider  now  choke  points  of  the  form  {A  i,B}  x  {X2,  E}  and  {Xi,X2}  x  {B,E}.  Choke 
point  {A,  C}  x  {B,  E}  does  not  exist,  because  this  combined  with  {A,  C}  x  { B ,  D}  generates 


69 


{A,  (7}  x  {D,E}.  Choke  points  {A,  D}  x  {if,  if}  and  {(7,-0}  x  {if,  if}  cannot  both  exist, 
since  they  jointly  imply  choke  point  {A,  (7}  x  {B,  E}. 

Assume  for  now  that  choke  point  {A,  D}  x  {B,  E}  exists  (but  not  {(7,  D }  x 
{ B ,  E}).  We  only  have  to  consider  now  choke  points  of  the  form  {Ad,  B }  x  {X2,  E}. 
Choke  point  {A,  B}  x  {(7,0}  cannot  exist,  since  by  exchanging  A  and  D,  B  and  (7  in  set 
{{A,  C}  x  {B,  D},  {A,  D}  x  {B,  if},  {A,  B}  x  {(7,  E}  we  get  {{A,  C}  x  {B,  D},  {A,  D}  x 
{B,  E},  { B ,  E}  x  {(7,  D},  which  by  Lemma  10  will  imply  all  tetrad  constraints  with  {A,  B,  (7,  D}. 

The  same  reasoning  applies  to  {0,(7}  x  {A,  E}  (exchanging  A  and  D,  B  and 
(7  in  the  given  tetrad  constraints)  by  using  Lemma  11.  The  same  reasoning  applies  to 
{0,(7}  x  {D,E}  (exchanging  A  and  D,  B  and  C  in  the  given  tetrad  constraints)  by  using 
Lemma  12. 

Because  of  the  assumed  {A,  (7}  x  {O,  D},  either  both  choke  points  {A,  E}  x  {B,D}, 
{C,E}  x  {O,  D}  exist  or  none  exists.  Because  of  the  assumed  {A,  D}  x  {O,  E},  either  both 
choke  points  {A,  E}  x  {0,0},  {A,  D}  x  {O,  O}  exist  or  none  exists.  That  is,  either  all  choke 
points  {{A,  E}  x  {O,  O},  {A,  O}  x  {O,  O},  {(7,  O}  x  {O,  O}}  exist  or  none  exist.  If  all  exist, 
create  G'  as  follows:  use  two  latents  Li,L2,  where  L\  is  a  parent  of  A,  C  and  L2,  L2  is  a 
parent  of  O,  O  and  E,  and  there  is  a  bi-directed  edge  (7  <-»■  E.  Construct  G"  by  adding  edge 
L2  — >  A  to  G' .  If  none  of  the  three  mentioned  choke  points  exist,  do  the  same  but  with  an 
extra  bi-directed  edge  0^0. 

Assume  now  that  choke  point  {(7,  O}  x  {O,  E}  exists  (but  not  {A,  O}  x 
{O,  E}).  This  is  analogous  to  the  previous  case  by  symmetry  of  A  and  (7. 

Assume  now  that  no  choke  point  {(7,  D}  x  {B,  E}  or  {A,  D}  x  {B,  E}  exists. 
We  only  have  to  consider  now  choke  points  of  the  form  {A i,  B}  x  {X2,E}.  Let 
Pair  1  be  the  set  of  choke  points  {{A,  B}  x  {(7,  E},  {A,  B}  x  {£),  if}}.  Let  Pair  2  be  the  set 
of  choke  points  {{B,  C}  x  {A,  if},  {B,  C}  x  {.D,  if}}.  Let  Pair  3  be  the  set  of  choke  points 
{{B,  D}  x  {A,  if},  {if,  D}  x  {(7,  if}}.  At  most  one  element  of  Pair  1  can  exist  (or  otherwise 
it  will  entail  {A,  B}  x  {(7,  D}).  For  the  same  reason,  at  most  one  element  of  Pair  2  can  exist. 
Either  both  elements  of  Pair  3  exist  or  none  exist. 

If  both  elements  of  Pair  3  exist,  then  no  element  of  Pair  1  or  Pair  2  can  exist.  For 
example,  {B,D}  x  {A,  if }  from  Pair  3  and  {B,C}  x  {A,  if }  from  Pair  2  together  entail 
{C,D}  x  {A,  if},  discarded  by  hypothesis.  In  the  case  where  both  elements  of  Pair  3  exist, 
construct  G'  as  follows:  let  L\  and  L2  be  two  latents,  where  L\  is  a  parent  of  A,  (7  and  L2l 
and  if 2  is  a  parent  of  B,  D  and  E.  Add  bi-directed  edges  A  <->•  if  and  (7  E.  Construct 
G"  by  adding  L2  — >  A  to  G' . 

Choke  point  {if,  (7}  x  {D,E}  (from  Pair  2)  cannot  co-exist  with  {A,  B}  x  {D,E} 
(from  Pair  1)  since  this  entails  {A,  (7}  x  { D ,  if}.  Moreover,  {B,  C }  x  {D,  E}  cannot  co-exist 
with  {A,  B}  x  {(7,  if}  (also  from  Pair  1),  since  {{A,  (7}  x  { B ,  D},  {A,  B}  x  {(7,  if},  {B,  C}  x 
{D,  if}},  which  by  exchanging  B  with  D  generates  {{A,  C}x{B,  D},  {A,  D}x{C,  if},  {B,  E}x 
{(7,  D}}.  From  Lemma  10,  this  implies  all  three  tetrads  in  the  covariance  of  {A,  if,  (7,  D},  a 
contradiction. 

By  symmetry  between  A  and  (7,  it  follows  that  no  two  elements  of  the  union  of 
Pair  1  and  Pair  2  can  simultaneously  exist.  Let  {X\,  B}  x  {X2,if}  be  a  choke  point  in  the 
union  of  Pair  1  and  Pair  2  that  is  assumed  to  exist.  Construct  G'  as  follows:  let  Li  and  L2 
be  two  latents,  where  Li  is  a  parent  of  A,  (7  and  L2l  and  L2  is  a  parent  of  if,  D.  If  X\  =  A 
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and  X‘2  =  C,  or  if  X\  =  C  and  X2  =  A,  let  Li  be  the  parent  of  E.  Otherwise,  let  L2  be  the 
parent  of  E.  Add  bi-directed  edges  between  E  and  every  element  in  X\{£>,  Ad}.  Construct 
G"  by  adding  L2  — »  A  to  G' . 

Finally,  if  no  element  in  Pairs  1,  2  or  3  is  assumed  to  exist,  create  G'  and  G"  as 
above,  but  connect  E  to  all  other  elements  of  X  by  bi-directed  edges.  □ 

Lemma  13  Let  G(O)  be  a  semilinear  latent  variable  graph.  Then,  if  for  {A,  B,C}  C  O 
we  have  pab  =  0  or  pab.c  =  0,  then  A  and  B  cannot  share  a  common  latent  parent  in  G 
with  probability  1  with  respect  to  a  Lebesgue  measure  over  the  coefficient  and  error  variance 
parameters. 

Proof:  Let  A,  B ,  C  be  defined  according  to  the  following  linear  functions 

A  =  aL  +  apAp  +  ca 

B  =  bL  +  b%Bi  +  cb 

C  =  Yjj  cjCj  +  ec 

where  L  is  a  common  latent  parent  of  A  and  B,  {Ap}  represents  parents  of  A,  {Bf\  are 
parents  of  B ,  {Cj}  parents  of  C,  and  {ap}U{&j}U{cj}U{a,  b,  Ca,  Cb,  Cc}  are  parameters  of  the 
measurement  model,  (Ca,  Cb,  Cc}  being  the  variances  of  error  terms  {ca,  Cb,  ec},  respectively. 

Assume  gab  =  0.  By  the  equations  above,  gab  =  abG2L  +  K,  where  no  term  in  K  that 
has  a  factor  ab.  For  this  identity  to  hold,  we  therefore  need  abaj  =  0.  By  assumption,  latent 
variables  have  positive  variance,  so  the  fact  that  abG2L  =  0  implies  g2l  =  0  is  a  contradiction. 

Since  pab.c  =  0  if  and  only  if  gab&c  ~  ^ac^bc  =  0  for  positive  Gq  ,  assume  the  latter. 

Expressing  this  polynomial  as  a  function  of  the  given  coefficients,  we  obtain  abG2LGf  +  Q. 
Since  C  is  not  an  ancestor  of  L  (because  L  is  latent)  no  term  in  abG2L  contains  the  symbol  Cc, 
nor  any  coefficient  {cj}.  Since  every  term  in  gac^bc  that  might  contain  Cc  must  also  contain 
some  {cj},  then  no  term  in  Gac&bc  can  cancel  any  term  in  abG2LCc  (which  is  contained  in 
abGQGc).  This  implies  abrijCc  =  0,  a  contradiction.  □ 

Lemma  14  Let  G(O)  be  a  latent  variable  graph.  Let  {A,  B,C}  C  O  be  some  triplet  such 
that  A  and  B  have  parents  L 1  and  L2,  respectively  (where  it  is  possible  that  Li  =  L2),  and  C 
is  not  an  ancestor  of  A  or  B.  Then,  if  gl1l2  f  0,  it  follows  that  pab.c  f  0  with  probability 
1  with  respect  to  a  Lebesgue  measure  over  the  coefficient  and  error  variance  parameters. 

Proof:  Let  the  structural  equations  for  A,  B  and  C  be  A  =  aL\  +  JA  a* A;  +  ea,  B  = 
bL2  +  JA  bjBj  +  €b  and  C  =  CkGk  +  ec,  where  ea,  and  ec  are  independent  random 
variables,  and  independent  of  every  other  random  variable  in  G  besides  their  respective 
descendants. 

We  have  that  Pab.c  f  0  vab&c^gac&bc  f  0-  We  will  prove  that  gab&c  —  gac&bc  f 
0.  From  the  above  equations,  we  have  that  gab&c~ aACaBC  —  [abaLlL2  +  Fi(A,  B)](F2(C)  + 
Cc)  —  &AC&BC,  where  no  term  in  Fx  {A,  B )  can  contain  the  product  ab,  every  term  in  F2(C ) 
contains  some  variable  Ck  as  well  as  every  term  in  gac^bc,  and  Cc  is  the  variance  of  the  error 
variance  of  C.  The  term  gl1l2  cannot  contain  any  variable  Ck,  since  C  is  not  an  ancestor 
of  A  or  B.  Therefore,  no  term  in  this  polynomial  can  cancel  the  term  abGL1L2 Cc,  and  since 
ohcr B\L2Cc  f  0,  it  follows  that  Pab.c  f  0-  □ 
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Lemma  15  Let  G( O)  be  a  latent  variable  graph  with  latent  covariance  matrix  T,L.  For  any 
set  {A,  B,  C,  D}  =  O'  C  O,  if  oaboCd  =  crAcaBD  =  vad°bd  and  for  every  set  {X,  Y }  C 
O',  Z  G  o  we  have  Pxy.z  ^  0  and  pxy  ^  0,  then  A  and  B  do  not  have  more  than  one  common 
parent  in  G  with  probability  1  with  respect  to  a  Lebesgue  measure  over  the  coefficient  and 
error  variance  parameters. 

Proof:  Assume  L  \  and  L-2  are  two  common  parents  of  A  and  B  in  G .  Let  the  graph  G'  have 
the  same  structure  as  G,  but  without  all  edges  from  other  possible  parents  of  A  and  B  not 
in  {Li,L2}.  Since  G'  is  more  constrained  than  G,  if  a  tetrad  constraint  holds  in  G,  then  it 
holds  in  G'.  By  Lemma  1,  no  element  in  O'  is  an  ancestor  of  any  other  element  in  this  set. 
Let  the  structural  equations  for  A,  B,  C  and  D  in  G'  be: 

A  =  ol\L\  +  CX2L2 
B  =  P\L2  +  P2L2 

c  =  E  jCjC, 

D  =  Ek  dkDk 

Consider  only  the  choice  of  coefficient  and  error  variances  by  which  the  given  constraint 
is  entailed  by  G  and  all  latent  covariance  matrices.  As  argued  in  previous  lemmas,  we  know 
this  happens  with  probability  1.  Since  the  tetrad  constraint  (Jab^cd  =  <7Ac&bd  is  entailed 
G',  we  have  (JAB(Jcd  -  ^ac^bd  =  0  =>  (a^a2^  +  ai/32<TLlL2  +  a2piaLlL2  +  a2f32(rl2)acD~ 

(«1  Ej  Cj^CjLx  +a2EjCJ<7QL2)(/5lEfc4^Ii  =  0  =>  «1  Pli^h^CD  ~ 

(Ej  CjO-CjLjiEk  dk°DkLi ))  +  /(G)  =  0,  where 

/(G)  =  (ai(d2aLlL2+cx2l3iaLlL2+a2^2crl2)crcD-a2  22  CjaCjL2(Pi  22  dk(rDkLl+P 2  22  dkaDkL2 ) 

j  k  k 

When  fully  expanding  /(G)  as  a  function  of  the  linear  parameters  of  G,  the  product  a.\(d\ 
cannot  possibly  appear,  since  no  element  in  O'  is  an  ancestor  of  any  other  element  in  this 
set,  Therefore,  since  the  polynomial  constraint  is  identically  zero  and  nothing  in  /(G)  can 
cancel  the  term  «i/3i,  we  have: 

al^CD  =  22  CJaCjL1  22  dkaDkL  1  (10) 

j  k 

Using  a  similar  argument  for  the  coefficients  of  atifo,  a2/3i  and  CK2/32,  we  get: 


CTl^CTCD  -  22Ct<TCiL  1  22dk<TDkL2 
j  k 

(ii) 

Vl1L2°'CD  =  22  CjaC3L2  22  dk(TDkLi 

j  k 

(12) 

aL2 aCD  =  22  Ci(JCjL2  22  dk&DkL2 

(13) 

j  k 


From  (10), (11),  (12),  (13),  it  follows: 
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=  [a  1  Ej  Cj(JCjL1  +  «2  Ej  Cj^La]  [al  Efc  dk&DkL  1  +  «2  Efc  <4<7DfcL2] 

=  al  Ej  CgVCjLi  Efc  dkODuL-t  +  aia2  Ej  cj(7CiL1  Efc  dk<rDkL 2  + 

ai«2  Ej  cj<JCjL2  E fc  dk&DkL  1  +  «2  Ej  Cj(TCjL2  Efc  dk&DkL2 
=  \a\oh  +  2aia2orL1L2  + 

=  <j\aCD 

which  implies  ct^d  —  WicWid^a)”1  =  0  =>  Pcd.a  =  0.  By  Lemma  14,  C  and  D  have  no 
correlated  parents,  which  entails  ctc/d  =  0  in  G' .  Since  all  treks  between  C  and  D  in  G  are 
preserved  in  G",  that  implies  ctcd  =  0  is  entailed  by  G.  Contradiction.  □ 

Lemma  16  Let  G(O)  6e  a  latent  variable  graph  with  latent  covariance  matrix  E^.  For  on?/ 
set  {A,  F,C,  F>}  =  O'  C  O,  if  (Jabctcd  =  vac°bd  =  °adObd  and  for  every  set  {X,  Y }  C 
O',  z  e  O  we  have  Pxy.z  f  0  and  pXy  f  0,  then  if  A  and  B  have  a  common  latent  parent 
Li  in  G,  B  and  C  have  a  common  latent  parent  L2  in  G,  we  have  Li  =  L2  with  probability 
1  with  respect  to  a  Lebesgue  measure  over  the  coefficient  and  error  variance  parameters. 

Proof:  Assume  A,  B  and  C  are  parameterized  as  follows: 

A  =  aLi  +  Z ]p  apAp 
B  =  b\L\  +  b2L2  +  Z li  biBi 
G  =  cL>  +  Ej  cjCj 

where  as  before  {Ap}  U  {Bi}  U  {Cj}  represents  the  possible  other  parents  of  A,  B  and  C , 
respectively.  Assume  Li  f  L2.  We  will  show  that  Pl\L2  =  1,  which  is  a  contradiction.  From 
the  given  tetrad  constraint  (Jab&cd  =  ^ad^bc,  and  the  fact  that  from  Lemma  1  we  have 
that  for  no  pair  {A",  Y}  C  O'  X  is  an  ancestor  of  Y,  if  we  factorize  the  constraint  according 
to  which  terms  include  ah \  c  as  a  factor,  we  obtain  with  probability  1: 

abic[a2LlaL2D  -  aLlDaLlL 2]  (14) 

If  we  factorize  such  constraint  according  to  ab2c ,  it  follows: 

ab2c[a Lil2ct l2d  ~  cr (15) 

From  (14)  and  (15),  it  follows  that  cr'f  =  {<Jl  1l2)2  =>■  Pl1l2  =  1-  Contradiction.  □ 

Theorem  4  The  output  of  FindPattern  is  a  generalized  measurement  pattern  with  respect 
to  the  tetrad  and  vanishing  partial  correlation  constraints  of  E  with  probability  1. 

Proof:  Two  nodes  will  not  share  a  common  latent  parent  in  a  measurement  pattern  if  and 
only  if  they  are  not  linked  by  an  edge  in  graph  C  constructed  by  algorithm  FindPattern 
and  that  happens  if  and  only  if  some  partial  correlation  vanishes  or  if  any  of  rules  CS1,  CS2 
or  CS3  holds.  But  then  by  Lemmas  3,  5,  6  and  13  the  claim  is  proved.  The  claim  about 
undirected  edges  follows  directly  from  Lemma  1.  □ 
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Theorem  5  LetGiO)  be  a  latent  variable  graph.  Then  the  output  o/BuildPureClusters 
is  a  valid  l-interpretation  for  G  in  the  family  of  tetrad  and  vanishing  partial  correlation 
constraints  and  a  pure  generalized  measurement  pattern. 

Proof:  The  output  is  a  pure  measurement  model  and  generalized  measurement  pattern  by 
construction:  each  node  has  only  one  latent  parent,  and  there  are  no  edges  linking  observed 
nodes.  We  only  have  to  show  that  all  tetrad  constraints  entailed  by  such  measurement  model 
also  hold  in  the  population  covariance  matrix. 

Let  {A,  B ,  C,  D}  be  four  observed  nodes.  If  { A ,  B ,  C}  belong  to  the  same  latent  parent, 
then  all  tetrad  constraints  will  be  entailed  by  a  pure  measurement  model  with  respect  to  a 
fourth  node  D ,  and  by  Step  5  of  Table  2,  this  will  be  guaranteed.  Now  suppose  { A ,  B }  have 
the  same  latent  parent,  while  C  and  D  are  children  of  other  parents  (where  C  and  D  might 
have  the  same  parent).  Then  the  tetrad  o’ac(Jbd  =  &AD&BC  will  be  entailed,  and  this  will 
always  hold  in  the  covariance  matrix,  by  Step  6  of  Table  2. 

The  tetrad  ctab&cd  =  &adO'bc  will  not  be  entailed:  if  Li  is  the  parent  of  A  and  B ,  L2  is 
the  parent  of  C  and  L3  is  the  parent  of  D ,  this  will  require  Pl2l3.Li  =  0,  which  will  hold  only 
in  some  latent  covariance  matrices,  contrary  to  the  definition  of  entailment  in  measurement 
models.  Similarly,  if  no  two  elements  in  { A ,  B ,  C,  D}  share  a  common  parent  in  the  output, 
then  no  tetrad  will  be  entailed  in  this  set  except  for  specific  latent  covariance  matrices.  □ 

Corollary  1  Let  G( O)  be  a  latent  variable  graph.  Then  the  output  of  BuildPureClus- 
ters  is  a  l-interpretation  for  G  in  the  family  of  tetrad  and  vanishing  partial  correlation 
constraints  even  when  rules  CS1,  CS2  and  CSS  are  applied  an  arbitrary  number  of  times  in 
FindPattern  for  any  arbitrary  subset  of  nodes  and  an  arbitrary  number  of  maximal  cliques 
is  found. 

Proof:  Independently  of  the  choice  made  on  Step  2  of  BuildPureClusters,  by  the  end 
of  Step  4  we  will  meet  all  the  conditions  used  to  prove  Theorem  5:  that  nodes  in  different 
clusters  cannot  share  a  same  parent  nor  be  ancestors  of  each  other.  The  rest  follows  directly 
from  the  proof  of  Theorem  5.  □ 

B  The  spiritual  coping  questionnaire 

The  following  questionnaire  is  provided  to  facilitate  understanding  of  the  religious/spiritural 
coping  example  given  in  Section  7.2.  It  can  also  serve  as  an  example  of  how  questionnaires 
are  actually  designed. 

Section  I  This  section  intends  to  measure  the  level  of  stress  of  the  subject.  In  the  actual 
questionnaire,  it  starts  with  the  following  instructions: 

Circle  the  number  next  to  each  item  to  indicate  how  stressful  each  of  these  events  has  been 
for  you  since  entered  your  graduate  program.  If  you  have  never  experienced  one  of  the  events 
listed  below,  then  circle  number  1.  If  one  of  the  events  listed  below  has  happened  to  you  mid 
has  caused  you  a  great  deal  of  stress,  rate  that  event  toward  the  “Extremely  Stressful”  end 
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of  the  rating  scale.  If  an  event  has  happened  to  you  while  you  have  been  in  graduate  school, 
but  has  not  bothered  you  at  all,  rate  that  event  toward  the  lower  end  of  the  scale  (“Not  at  all 
Stressful”). 

The  student  then  chooses  the  level  of  stress  by  circling  a  number  on  a  7  point  scale.  The 
questions  of  this  section  are: 

1.  Fulfilling  responsibilities  both  at  home  and  at  school 

2.  Trying  to  meet  peers  of  your  race/ethnicity  on  campus 

3.  Taking  exams 

4.  Being  obligated  to  participate  in  family  functions 

5.  Arranging  childcare 

6.  Finding  support  groups  sensitive  to  your  needs 

7.  Fear  of  failing  to  meet  program  expectations 

8.  Participating  in  class 

9.  Meeting  with  faculty 

10.  Living  in  the  local  community 

11.  Handling  relationships 

12.  Handling  the  academic  workload 

13.  Peers  treating  you  unlike  the  way  they  treat  each  other 

14.  Faculty  treating  you  differently  than  your  peers 

15.  Writing  papers 

16.  Paying  monthly  expenses 

17.  Family  having  money  problems 

18.  Adjusting  to  the  campus  environment 

19.  Being  obligated  to  repay  loans 

20.  Anticipation  of  finding  full-time  professional  work 

21.  Meeting  deadlines  for  course  assignments 

Section  II  This  section  intends  to  measure  the  level  of  depression  of  the  subject.  In  the 
actual  questionnaire,  it  starts  with  the  following  instructions: 

Below  is  a  list  of  the  ways  you  might  have  felt  or  behaved.  Please  tell  me  how  often  you  have 
felt  this  way  during  the  past  week. 

The  student  then  chooses  the  level  of  frequency  that  some  events  happened  to  him/her  by 
circling  a  number  on  a  4  point  scale.  The  scale  is  “Rarely  or  None  of  the  Time  (less  than  1 
day)”,  “Some  or  Little  of  the  Time  (1-2  days)”,  “Occasionally  or  a  Moderate  Amount  of 
the  Time  (3-4  days)”  and  “Most  or  All  of  the  Time  (5-7  days)”.  The  events  are  as  follows: 
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1.  I  was  bothered  by  things  that  usually  don’t  bother  me 

2.  I  did  not  feel  like  eating;  my  appetite  was  poor 

3.  I  felt  that  I  could  not  shake  off  the  blues  even  with  help  from  my  family  or  friends 

4.  I  felt  that  I  was  just  as  good  as  other  people 

5.  I  had  trouble  keeping  my  mind  on  what  I  was  doing 

6.  I  felt  depressed 

7.  I  felt  that  everything  I  did  was  an  effort 

8.  I  felt  hopeful  about  the  future 

9.  I  thought  my  life  had  been  a  failure 

10.  I  felt  fearful 

11.  My  sleep  was  restless 

12.  I  was  happy 

13.  I  talked  less  than  usual 

14.  I  felt  lonely 

15.  People  were  unfriendly 

16.  I  enjoyed  life 

17.  I  had  crying  spells 

18.  I  felt  sad 

19.  I  felt  that  people  disliked  me 

20.  I  could  not  get  “going” 

Section  III  This  section  intends  to  measure  the  level  of  spiritual  coping  of  the  subject.  In 
the  actual  questionnaire,  it  starts  with  the  following  instructions: 

Please  think  about  how  you  try  to  understand  and  deal  with  major  problems  in  your  life. 
These  items  ask  what  you  did  to  cope  with  your  negative  event.  Each  item  says  something 
about  a  particular  way  of  coping.  To  what  extent  is  your  religion  or  higher  power  involved 
in  the  way  you  cope? 

The  student  then  chooses  the  level  of  importance  of  some  spiritual  guideline  by  circling  a 
number  on  a  4  point  scale.  The  scale  is  “Not  at  all”,  “Somewhat”,  “Quite  a  bit”,  “A  great 
deal” .  The  guidelines  are: 

1.  I  think  about  how  my  life  is  part  of  a  larger  spiritual  force 

2.  I  work  together  with  God  (high  power)  as  partners  to  get  through  hard  times 

3.  I  look  to  God  (high  power)  for  strength,  support,  and  guidance  in  crises 

4.  I  try  to  find  the  lesson  from  God  (high  power)  in  crises 
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5.  I  confess  my  sins  and  ask  for  God  (high  power) ’s  forgiveness 

6.  I  feel  that  stressful  situations  are  God  (high  power)’s  way  of  punishing  me  for  my  sins 
or  lack  of  spirituality 

7.  I  wonder  whether  God  has  abandoned  me 

8.  I  try  to  make  sense  of  the  situation  and  decide  what  to  do  without  relying  on  God 
(high  power) 

9.  I  question  whether  God  (high  power)  really  exists 

10.  I  express  anger  at  God  (high  power)  for  letting  terrible  things  happen 

11.  I  do  what  I  can  and  put  the  rest  in  God  (high  power) ’s  hands 

12.  I  do  not  try  much  of  anything;  simply  expect  God  (high  power)  to  take  my  worries 
away 

13.  I  pray  for  a  miracle 

14.  I  pray  to  get  my  mind  off  of  my  problems 

15.  I  ignore  advice  that  is  inconsistent  with  my  faith 

16.  I  look  for  spiritual  support  from  clergy 

17.  I  disagree  with  what  my  religion  wants  me  to  do  or  believe 

18.  I  ask  God  (high  power)  to  help  me  find  a  new  purpose  in  life 

19.  I  try  to  find  a  completely  new  life  through  religion 

20.  I  seek  help  from  God  (high  power)  in  letting  go  of  my  anger 
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